Re: Flink With Hudi

Taher Koitawala Tue, 19 Mar 2019 11:49:17 -0700

In a nutshell what I'm talking about is having a DeltaStreamer written in
Flink (or beam) which can use the engine extensively. And get coupled along
with user code to make it extensive and amazing to use.


Let me give you a short example.

1. Typical Flink Pipeline:
 A. Suppose we have 2 topics in kafka. T1 and T2 which consists of messages
which can be joined with each other based on a certain key say "id" and the
timestamps which are embedded in messages in both topics. So this is just
event time based stream processing in flink.

B. Flink matches the 2 streams with a Checkpointing interval of lets say 10
minutes. Basically a checkpoint tells flink to snapshot its current state
and also to flush the data it wss holding inmemory to the dfs. In the
recent data sinks of Flink, data is written upon checkpointing. (See class
Streamingfilesink in flink 1.7.2)

C. Derive Hive tables further on the matched records written by flink sah
"tab1". Use another table say "tab2" which gets data from XYZ source but is
used to enrich our matched table.

D. Some other flink pipeline then listens to the dir where all the enriched
data is being written, and is publishing it again to some other kafka
source, or then to an another application on port 8080. Or feeding an ML
model. Etc etc.

Problem:
1. Need 2 flink pipelines
2. Don't know when tab2 which enriches tab1 will get new data. And we will
have to rerun the batch and need to be done again and again whenever new
data arrives for enrichment table.
3. Since whenever new enrichment data arrives we might need to rerun our
enrichment job again the second flink pipeline needs to know if it has seen
this record in the past or not and if this record need to be sent further
or not.
4. Management of another flink pipeline separately.

Solution (With Hudi):

1. Write data from the first flink pipeline in form of HoodieInputFormat.

2. When it comes to joining the tab1 and tab2 tables now since both tables
are in HoodieInputFormat I can freely choose between Spark and Presto for
my joins which is faster and will be done incrementally.

3. I don't need to write a dedup logic in my second flink pipeline.
DeltaStreamer should take care of that, give me only the records that
either got updated and inserted which i can send back to kafka or the
Application on port 8080.

4. Also did you notice how significant of a time and code saving was done
here. In such a simple use case. Plus I as a user was free to choose the
spark or presto layer or could even choose flink layer in between when
joining tab1 and tab2 for enrichment in seconds.

Hope this example makes sense to you guys.

Regards,
Taher Koitawala



On Tue, 19 Mar, 2019, 11:47 PM Taher Koitawala, <[email protected]>
wrote:

> Hi @Semantic Beeng,  I am keen about getting flink into hudi as I
> personally see a lot of good things we can do with it. I am currently
> preparing a small google doc with a Sample use case which will help us
> understand why we think flink should be included.
>
>           As per efforts I'm willing to work on this and we can go as per
> your say as I have fairly good knowledge of Flink.
>
>
> Regards,
> Taher Koitawala
>
> On Tue, 19 Mar, 2019, 11:27 PM Vinoth Chandar, <[email protected]> wrote:
>
>> yeah. This has come up with in our company as well ...
>>
>>
>> On Mon, Mar 18, 2019 at 7:28 PM Pingle Wang <[email protected]> wrote:
>>
>> > I am very happy to see everyone discussing this topic, there are still
>> > many companies using Flink, if Hudi can support Flink, this will attract
>> > this part of the user, and further develop our hudi project.
>> >
>> >
>> > thanks.
>> > ------------------ 原始邮件 ------------------
>> > 发件人: "Vinoth Chandar"<[email protected]>;
>> > 发送时间: 2019年3月19日(星期二) 凌晨3:45
>> > 收件人: "Semantic Beeng"<[email protected]>;
>> > 抄送: "dev"<[email protected]>;
>> > 主题: Re: Flink With Hudi
>> >
>> >
>> >
>> > Sorry for the late reply. Busy Sunday :)
>> >
>> > First off, this is a very interesting topic.
>> >
>> > @Semantic Beeng <[email protected]> , +1 It would be good to lay
>> out
>> > the current use-cases not met by Spark execution specifically related to
>> > Hudi's write path..
>> >
>> > @taher,  Definitely Flink has its advantages like you mention.
>> > Two cases where I thought direct Flink support for writing datasets
>> would
>> > be good are :
>> >
>> > 1) Capture result of Flink jobs and write it out as Hudi dataset.  But
>> one
>> > can always write a Flink job to compute the results and then store it in
>> > Hudi, using it as a Sink?
>> > Something like :   Kafka => Flink => Kafka => DeltaStreamer => Hudi on
>> dfs
>> > 2) If someone is not using Spark at all, then Hudi brings it in and
>> > potentially increases ops costs?
>> >
>> > On Beam, while its admittedly new, if we were to abstract away Spark and
>> > Flink from Hudi code (once again non-trivial amount of work ;)), then
>> Beam
>> > is very attractive.
>> > We will end up inventing a Beam-- otherwise anyway :)
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Sun, Mar 17, 2019 at 10:22 PM Semantic Beeng <[email protected]
>> >
>> > wrote:
>> >
>> > > Hello Taher,
>> > >
>> > > Vinoth, was looking for such an assessment from you - thanks. :-)
>> > >
>> > > Taher - at the high level it sounds interesting to explore some Flink
>> > > specific or common use cases, I think.
>> > >
>> > > We are have discussions about integrating Hudi with Beam so if you can
>> > > relate your use cases to Flink + Beam it would be interesting.
>> > >
>> > > I can imagine useful scenarios where Spark based analytics would be
>> > > combined with Flink based analytics going through parquet.
>> > >
>> > > But do not know Flink enough to see where Hudi like functionality
>> would
>> > > fit.
>> > >
>> > > Could you provide such use cases (all kinds above and others) ?
>> Ideally
>> > > with code references.
>> > >
>> > > Depending on how serious your interest is we can go deeper in wiki.
>> > >
>> > > Thanks
>> > >
>> > > Nick
>> > >
>> > >
>> > >
>> > >
>> > > On March 17, 2019 at 3:34 AM Vinoth Chandar < [email protected]>
>> wrote:
>> > >
>> > >
>> > > Hi Taher,
>> > >
>> > > Thanks for kicking off this thread. We can use this itself to discuss
>> > > Flink. Hudi uses Spark today on the writing side and the micro-batch
>> > model
>> > > actually fits very well. Given cloud stores don't support appends
>> anyway,
>> > > we would end up micro-batching nonetheless even with Flink.
>> Abstracting
>> > > out
>> > > Spark would be a large effort (gets me to think, if we should then
>> just
>> > > rewrite on top of Beam ;)) and I have not thought of any unique
>> > advantages
>> > > we get for Hudi by adding Flink. Do you have something in mind?
>> > >
>> > > If you can expand on where the gaps are with only having Spark/Hudi,
>> > > that'd
>> > > be really educative..
>> > >
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Sat, Mar 16, 2019 at 11:17 PM Taher Koitawala <
>> > > [email protected]>
>> > > wrote:
>> > >
>> > > Hi Prasanna,
>> > > Thank you for your reply. Should we start a discussion or open a jira
>> > > on this regard then?
>> > >
>> > > On Sun, 17 Mar, 2019, 11:36 AM Prasanna, < [email protected]>
>> > wrote:
>> > >
>> > > Hello,
>> > >
>> > > I dont know of any effort to write hudi with flink.
>> > >
>> > >    - Prasanna
>> > >
>> > > On Sat, Mar 16, 2019 at 10:44 PM Taher Koitawala <
>> > > [email protected]>
>> > > wrote:
>> > >
>> > > Hey Guys, Any inputs on this?
>> > >
>> > > On Sat, 16 Mar, 2019, 12:35 PM Taher Koitawala, <
>> > >
>> > > [email protected]
>> > >
>> > > wrote:
>> > >
>> > > Hi Guys, I have recently been exploring about Hudi. It manages to
>> > >
>> > > solve a
>> > >
>> > > lot of our current use cases however my question is Can I use flink
>> > >
>> > > with
>> > >
>> > > Hudi? So far I have only seen spark integration with Hudi.
>> > >
>> > > Flink being more of a real-time processing engine rather than near
>> > >
>> > > real
>> > >
>> > > time and with its rich functions like Checkpointing for fault
>> > >
>> > > tolerance,
>> > >
>> > > States for instream computations, better windowing capabilities and
>> > >
>> > > very
>> > >
>> > > high stream throughput, and the exactly once semantics from source to
>> > >
>> > > sink.
>> > >
>> > > Flink is capable of being a part of Hudi to solve our instream use
>> > >
>> > > cases.
>> > >
>> > > >
>> > >
>> > > Regards,
>> > > Taher Koitawala
>> > > GS Lab Pune
>> > > +91 8407979163
>> > >
>> > >
>> > >
>> > >
>>
>

--

Re: Flink With Hudi

Reply via email to