Re: [DISCUSS] Integrate Hudi with Apache Flink

Vinoth Chandar Thu, 01 Aug 2019 09:20:17 -0700

Here are my thoughts..

Last time, when Flink was brought up, we dug into the use-case and realized
that having Flink/Beam support for windowing on physical/arrival time
(hoodie_commit_time) would be valuable and that's why Flink was being
proposed.


I would like to separate two aspects that I feel are intermingled here.

1) Writing datasets using Flink :  Today hoodie-spark-datasource or
deltastreamer tool all use Spark to write Hudi datasets. It would be nice
if we can do this as a part of a Flink job as well.
2) Query Hudi datasets using Flink : we can perform awesome streaming style
pipelines on top of Hudi, since it provided the _hoodie_commit_time arrival
time watermarks.. Nick & I are trying to flesh this out more with
motivating use-cases and make the case for doing this.


Now questions for folks driving HUDI-184. Is the scope 1 or 2 or 1 & 2. ?
My suggestion would be to tackle 1 in HUDI-184 and Nick/I can parallel
tackle 2

This is exciting work :). Hope we can get past the current release, jar
fixes and get to this.. ha ha.

/thanks/vinoth






On Wed, Jul 31, 2019 at 6:01 AM Semantic Beeng <[email protected]>
wrote:

> All,
>
> @vc and I have been mulling on this for a while and are working on some
> material to start this.
>
> But
>
> 1. We want to start with requirements, right?
>
> Last time we discussed this we asked for use cases, needs etc.
>
> Have some here
> https://cwiki.apache.org/confluence/display/HUDI/Hudi+for+Continuous+Deep+Analytics
> .
>
> Taher - any news on that example application about trade reconciliation,
> please?
>
> 2. Will push that we also drive this with proper architecture decisions to
> map the choices in a principled way.
>
> This will also help users make sense of fit with their architectures. See
> https://adr.github.io
>
> As architect consider that technology to technology integrations are bad
> idea.
>
> Reminds us of the M to N integration (point to point) in enterprise
> systems.
>
> Examples
>
> 1.
> https://github.com/alibaba/flink-ai-extended/tree/master/flink-ml-tensorflow
>
> 2. https://github.com/yahoo/TensorFlowOnSpark
>
> And now imagine Hudi hard linked to Flink.
>
> Someone trying to use both Spark and TF for ML and and Flink for data
> sliding would be in a tough spot to reconcile.
>
> And surely quite a few library version conflicts too.
>
> Instead we need to seek some abstractions in between them to decouple.
>
> Hence, the more use cases and design examples you provide the better. :-)
>
> @vc - thoughts?
>
> Kind regards
>
> Nick
>
>
>
>
>
> On July 31, 2019 at 8:06 AM Vinoth Chandar <[email protected]> wrote:
>
>
> >>First of all, we should agree on the plan.
> +100 . this will be a very involved process.. if we can get a plan agreed
> upon, then we can start scoping the subtasks..
>
> On Wed, Jul 31, 2019 at 2:11 AM Vinay Patil <[email protected]>
> wrote:
>
> Hi Guys,
>
> Add me in this as well, missed out on this last time.
>
> Regards,
> Vinay Patil
>
>

Re: [DISCUSS] Integrate Hudi with Apache Flink

Reply via email to