Sorry for the late reply. Busy Sunday :) First off, this is a very interesting topic.
@Semantic Beeng <[email protected]> , +1 It would be good to lay out the current use-cases not met by Spark execution specifically related to Hudi's write path.. @taher, Definitely Flink has its advantages like you mention. Two cases where I thought direct Flink support for writing datasets would be good are : 1) Capture result of Flink jobs and write it out as Hudi dataset. But one can always write a Flink job to compute the results and then store it in Hudi, using it as a Sink? Something like : Kafka => Flink => Kafka => DeltaStreamer => Hudi on dfs 2) If someone is not using Spark at all, then Hudi brings it in and potentially increases ops costs? On Beam, while its admittedly new, if we were to abstract away Spark and Flink from Hudi code (once again non-trivial amount of work ;)), then Beam is very attractive. We will end up inventing a Beam-- otherwise anyway :) Thanks Vinoth On Sun, Mar 17, 2019 at 10:22 PM Semantic Beeng <[email protected]> wrote: > Hello Taher, > > Vinoth, was looking for such an assessment from you - thanks. :-) > > Taher - at the high level it sounds interesting to explore some Flink > specific or common use cases, I think. > > We are have discussions about integrating Hudi with Beam so if you can > relate your use cases to Flink + Beam it would be interesting. > > I can imagine useful scenarios where Spark based analytics would be > combined with Flink based analytics going through parquet. > > But do not know Flink enough to see where Hudi like functionality would > fit. > > Could you provide such use cases (all kinds above and others) ? Ideally > with code references. > > Depending on how serious your interest is we can go deeper in wiki. > > Thanks > > Nick > > > > > On March 17, 2019 at 3:34 AM Vinoth Chandar < [email protected]> wrote: > > > Hi Taher, > > Thanks for kicking off this thread. We can use this itself to discuss > Flink. Hudi uses Spark today on the writing side and the micro-batch model > actually fits very well. Given cloud stores don't support appends anyway, > we would end up micro-batching nonetheless even with Flink. Abstracting > out > Spark would be a large effort (gets me to think, if we should then just > rewrite on top of Beam ;)) and I have not thought of any unique advantages > we get for Hudi by adding Flink. Do you have something in mind? > > If you can expand on where the gaps are with only having Spark/Hudi, > that'd > be really educative.. > > > Thanks > Vinoth > > On Sat, Mar 16, 2019 at 11:17 PM Taher Koitawala < > [email protected]> > wrote: > > Hi Prasanna, > Thank you for your reply. Should we start a discussion or open a jira > on this regard then? > > On Sun, 17 Mar, 2019, 11:36 AM Prasanna, < [email protected]> wrote: > > Hello, > > I dont know of any effort to write hudi with flink. > > - Prasanna > > On Sat, Mar 16, 2019 at 10:44 PM Taher Koitawala < > [email protected]> > wrote: > > Hey Guys, Any inputs on this? > > On Sat, 16 Mar, 2019, 12:35 PM Taher Koitawala, < > > [email protected] > > wrote: > > Hi Guys, I have recently been exploring about Hudi. It manages to > > solve a > > lot of our current use cases however my question is Can I use flink > > with > > Hudi? So far I have only seen spark integration with Hudi. > > Flink being more of a real-time processing engine rather than near > > real > > time and with its rich functions like Checkpointing for fault > > tolerance, > > States for instream computations, better windowing capabilities and > > very > > high stream throughput, and the exactly once semantics from source to > > sink. > > Flink is capable of being a part of Hudi to solve our instream use > > cases. > > > > > Regards, > Taher Koitawala > GS Lab Pune > +91 8407979163 > > > >
