On Beam, I tested over a year ago and Spark batch/Flink streaming under Beam worked well IMO. Our mentors both know this area very well. May be they can chime in as well? :)
I think I understand Arrow fairly well. It can read parquet data into memory efficiently using zero-copy etc and then lay it out optimized for vectorization by queries. Even if you read/write parquet on disk, the bottleneck is often CPU and not really I/O. Arrow helps with all this. We have thought about building a mutable cache using Arrow before, to improve real-time view performance. Something like : Source => Hudi DeltaStreamer => Hudi Dataset on DFS => Incremental Pull => Arrow cache Hudi works with a DFS abstraction (HDFS,S3, igfs etc) and manages upserts on them. Are you suggesting Hudi work with Arrow buffers directly? Thanks Vinoth On Mon, Mar 18, 2019 at 1:35 PM Semantic Beeng <[email protected]> wrote: > @vc I also see Beam working well as a common denominator between Spark and > Fink. > > Is there any through coverage of how Beam work in regards with Spark and > Flink? We could use that as a source of usage examples. > > I know of Scio and Featran but not well enough. > > https://github.com/spotify/scio > > > https://github.com/spotify/scio/tree/master/scio-examples/src/main/scala/com/spotify/scio/examples/cookbook > > @vc an idea triggered by your 1 below. > > If one (me) is interested to compose "analyses" closely (without writing > to disk or kafka) then would it make sense to "write to Parquet" in memory > through Apache Arrow ? > > https://arrow.apache.org/docs/python/parquet.html > > "Apache Arrow is an ideal in-memory transport layer for data that is being > read or written with Parquet files." > > https://arrow.apache.org/docs/format/IPC.html > > Is this a possibly good idea ? > > I can see how this does not make sense because Hudi maintains a > sophisticated handle of the actual files. (mulling) > > But maybe there is a way to "stream: the updates also over Arrow somehow > (maybe we can have some abstractions over how updates are "streamed" in or > related to this DeltaStreamer somehow) > > If you think it has value then will elaborate a bit more as a use case in > wiki. > > The nice thing is that Arrow gives a shared runtime : can have f1 in JVM > writing and f2 in CPython reading while sharing same off-heap/native memory. > > Thoughts? > > Any sense of what it would take to abstract at this level in Hudi ? > > Cheers > > Nick > > > On March 18, 2019 at 3:45 PM Vinoth Chandar <[email protected]> wrote: > > Sorry for the late reply. Busy Sunday :) > > First off, this is a very interesting topic. > > @Semantic Beeng <[email protected]> , +1 It would be good to lay out > the current use-cases not met by Spark execution specifically related to > Hudi's write path.. > > @taher, Definitely Flink has its advantages like you mention. > Two cases where I thought direct Flink support for writing datasets would > be good are : > > 1) Capture result of Flink jobs and write it out as Hudi dataset. But one > can always write a Flink job to compute the results and then store it in > Hudi, using it as a Sink? > Something like : Kafka => Flink => Kafka => DeltaStreamer => Hudi on dfs > 2) If someone is not using Spark at all, then Hudi brings it in and > potentially increases ops costs? > > On Beam, while its admittedly new, if we were to abstract away Spark and > Flink from Hudi code (once again non-trivial amount of work ;)), then Beam > is very attractive. > We will end up inventing a Beam-- otherwise anyway :) > > Thanks > Vinoth > > On Sun, Mar 17, 2019 at 10:22 PM Semantic Beeng < [email protected]> > wrote: > > Hello Taher, > > Vinoth, was looking for such an assessment from you - thanks. :-) > > Taher - at the high level it sounds interesting to explore some Flink > specific or common use cases, I think. > > We are have discussions about integrating Hudi with Beam so if you can > relate your use cases to Flink + Beam it would be interesting. > > I can imagine useful scenarios where Spark based analytics would be > combined with Flink based analytics going through parquet. > > But do not know Flink enough to see where Hudi like functionality would > fit. > > Could you provide such use cases (all kinds above and others) ? Ideally > with code references. > > Depending on how serious your interest is we can go deeper in wiki. > > Thanks > > Nick > > > > > On March 17, 2019 at 3:34 AM Vinoth Chandar < [email protected]> wrote: > > > Hi Taher, > > Thanks for kicking off this thread. We can use this itself to discuss > Flink. Hudi uses Spark today on the writing side and the micro-batch model > actually fits very well. Given cloud stores don't support appends anyway, > we would end up micro-batching nonetheless even with Flink. Abstracting > out > Spark would be a large effort (gets me to think, if we should then just > rewrite on top of Beam ;)) and I have not thought of any unique advantages > we get for Hudi by adding Flink. Do you have something in mind? > > If you can expand on where the gaps are with only having Spark/Hudi, > that'd > be really educative.. > > > Thanks > Vinoth > > On Sat, Mar 16, 2019 at 11:17 PM Taher Koitawala < > [email protected]> > wrote: > > Hi Prasanna, > Thank you for your reply. Should we start a discussion or open a jira > on this regard then? > > On Sun, 17 Mar, 2019, 11:36 AM Prasanna, < [email protected]> wrote: > > Hello, > > I dont know of any effort to write hudi with flink. > > - Prasanna > > On Sat, Mar 16, 2019 at 10:44 PM Taher Koitawala < > [email protected]> > wrote: > > Hey Guys, Any inputs on this? > > On Sat, 16 Mar, 2019, 12:35 PM Taher Koitawala, < > > [email protected] > > wrote: > > Hi Guys, I have recently been exploring about Hudi. It manages to > > solve a > > lot of our current use cases however my question is Can I use flink > > with > > Hudi? So far I have only seen spark integration with Hudi. > > Flink being more of a real-time processing engine rather than near > > real > > time and with its rich functions like Checkpointing for fault > > tolerance, > > States for instream computations, better windowing capabilities and > > very > > high stream throughput, and the exactly once semantics from source to > > sink. > > Flink is capable of being a part of Hudi to solve our instream use > > cases. > > > > > Regards, > Taher Koitawala > GS Lab Pune > +91 8407979163 > > > > >
