Re: [DISCUSS] Integrate Hudi with Apache Flink

Vinay Patil Thu, 01 Aug 2019 09:25:14 -0700

Hi Vinoth,

Thank you for proposing this plan, let's keep the scope to 1&2 , as part of
v1 let's start with point 1 and you guys can tackle point 2 in parallel.


Excited to be a part of this development.

Regards,
Vinay Patil


On Thu, 1 Aug 2019, 21:49 Vinoth Chandar, <[email protected]> wrote:

> Here are my thoughts..
>
> Last time, when Flink was brought up, we dug into the use-case and realized
> that having Flink/Beam support for windowing on physical/arrival time
> (hoodie_commit_time) would be valuable and that's why Flink was being
> proposed.
>
> I would like to separate two aspects that I feel are intermingled here.
>
> 1) Writing datasets using Flink :  Today hoodie-spark-datasource or
> deltastreamer tool all use Spark to write Hudi datasets. It would be nice
> if we can do this as a part of a Flink job as well.
> 2) Query Hudi datasets using Flink : we can perform awesome streaming style
> pipelines on top of Hudi, since it provided the _hoodie_commit_time arrival
> time watermarks.. Nick & I are trying to flesh this out more with
> motivating use-cases and make the case for doing this.
>
>
> Now questions for folks driving HUDI-184. Is the scope 1 or 2 or 1 & 2. ?
> My suggestion would be to tackle 1 in HUDI-184 and Nick/I can parallel
> tackle 2
>
> This is exciting work :). Hope we can get past the current release, jar
> fixes and get to this.. ha ha.
>
> /thanks/vinoth
>
>
>
>
>
>
> On Wed, Jul 31, 2019 at 6:01 AM Semantic Beeng <[email protected]>
> wrote:
>
> > All,
> >
> > @vc and I have been mulling on this for a while and are working on some
> > material to start this.
> >
> > But
> >
> > 1. We want to start with requirements, right?
> >
> > Last time we discussed this we asked for use cases, needs etc.
> >
> > Have some here
> >
> https://cwiki.apache.org/confluence/display/HUDI/Hudi+for+Continuous+Deep+Analytics
> > .
> >
> > Taher - any news on that example application about trade reconciliation,
> > please?
> >
> > 2. Will push that we also drive this with proper architecture decisions
> to
> > map the choices in a principled way.
> >
> > This will also help users make sense of fit with their architectures. See
> > https://adr.github.io
> >
> > As architect consider that technology to technology integrations are bad
> > idea.
> >
> > Reminds us of the M to N integration (point to point) in enterprise
> > systems.
> >
> > Examples
> >
> > 1.
> >
> https://github.com/alibaba/flink-ai-extended/tree/master/flink-ml-tensorflow
> >
> > 2. https://github.com/yahoo/TensorFlowOnSpark
> >
> > And now imagine Hudi hard linked to Flink.
> >
> > Someone trying to use both Spark and TF for ML and and Flink for data
> > sliding would be in a tough spot to reconcile.
> >
> > And surely quite a few library version conflicts too.
> >
> > Instead we need to seek some abstractions in between them to decouple.
> >
> > Hence, the more use cases and design examples you provide the better. :-)
> >
> > @vc - thoughts?
> >
> > Kind regards
> >
> > Nick
> >
> >
> >
> >
> >
> > On July 31, 2019 at 8:06 AM Vinoth Chandar <[email protected]> wrote:
> >
> >
> > >>First of all, we should agree on the plan.
> > +100 . this will be a very involved process.. if we can get a plan agreed
> > upon, then we can start scoping the subtasks..
> >
> > On Wed, Jul 31, 2019 at 2:11 AM Vinay Patil <[email protected]>
> > wrote:
> >
> > Hi Guys,
> >
> > Add me in this as well, missed out on this last time.
> >
> > Regards,
> > Vinay Patil
> >
> >
>

Re: [DISCUSS] Integrate Hudi with Apache Flink

Reply via email to