Re: Flink With Hudi (in memory Parquet through Arrow)

Vinoth Chandar Tue, 19 Mar 2019 10:56:40 -0700

On Beam, I tested over a year ago and Spark batch/Flink streaming under
Beam worked well IMO. Our mentors both know this area very well. May be
they can chime in as well? :)


I think I understand Arrow fairly well. It can read parquet data into
memory efficiently using zero-copy etc and then lay it out optimized for
vectorization by queries. Even if you read/write parquet on disk, the
bottleneck is often CPU and not really I/O. Arrow helps with all this. We
have thought about building a mutable cache using Arrow before, to improve
real-time view performance.

Something like :
Source => Hudi DeltaStreamer => Hudi Dataset on DFS => Incremental Pull =>
Arrow cache

Hudi works with a DFS abstraction (HDFS,S3, igfs etc) and manages upserts
on them. Are you suggesting Hudi work with Arrow buffers directly?


Thanks
Vinoth




On Mon, Mar 18, 2019 at 1:35 PM Semantic Beeng <[email protected]>
wrote:

> @vc I also see Beam working well as a common denominator between Spark and
> Fink.
>
> Is there any through coverage of how Beam work in regards with Spark and
> Flink? We could use that as a source of usage examples.
>
> I know of Scio and Featran but not well enough.
>
> https://github.com/spotify/scio
>
>
> https://github.com/spotify/scio/tree/master/scio-examples/src/main/scala/com/spotify/scio/examples/cookbook
>
> @vc an idea triggered by your 1 below.
>
> If one (me) is interested to compose "analyses" closely (without writing
> to disk or kafka) then would it make sense to "write to Parquet" in memory
> through Apache Arrow ?
>
> https://arrow.apache.org/docs/python/parquet.html
>
> "Apache Arrow is an ideal in-memory transport layer for data that is being
> read or written with Parquet files."
>
> https://arrow.apache.org/docs/format/IPC.html
>
> Is this a possibly good idea ?
>
> I can see how this does not make sense because Hudi maintains a
> sophisticated handle of the actual files. (mulling)
>
> But maybe there is a way to "stream: the updates also over Arrow somehow
> (maybe we can have some abstractions over how updates are "streamed" in or
> related to this DeltaStreamer somehow)
>
> If you think it has value then will elaborate a bit more as a use case in
> wiki.
>
> The nice thing is that Arrow gives a shared runtime : can have f1 in JVM
> writing and f2 in CPython reading while sharing same off-heap/native memory.
>
> Thoughts?
>
> Any sense of what it would take to abstract at this level in Hudi ?
>
> Cheers
>
> Nick
>
>
> On March 18, 2019 at 3:45 PM Vinoth Chandar <[email protected]> wrote:
>
> Sorry for the late reply. Busy Sunday :)
>
> First off, this is a very interesting topic.
>
> @Semantic Beeng <[email protected]> , +1 It would be good to lay out
> the current use-cases not met by Spark execution specifically related to
> Hudi's write path..
>
> @taher,  Definitely Flink has its advantages like you mention.
> Two cases where I thought direct Flink support for writing datasets would
> be good are :
>
> 1) Capture result of Flink jobs and write it out as Hudi dataset.  But one
> can always write a Flink job to compute the results and then store it in
> Hudi, using it as a Sink?
> Something like :   Kafka => Flink => Kafka => DeltaStreamer => Hudi on dfs
> 2) If someone is not using Spark at all, then Hudi brings it in and
> potentially increases ops costs?
>
> On Beam, while its admittedly new, if we were to abstract away Spark and
> Flink from Hudi code (once again non-trivial amount of work ;)), then Beam
> is very attractive.
> We will end up inventing a Beam-- otherwise anyway :)
>
> Thanks
> Vinoth
>
> On Sun, Mar 17, 2019 at 10:22 PM Semantic Beeng < [email protected]>
> wrote:
>
> Hello Taher,
>
> Vinoth, was looking for such an assessment from you - thanks. :-)
>
> Taher - at the high level it sounds interesting to explore some Flink
> specific or common use cases, I think.
>
> We are have discussions about integrating Hudi with Beam so if you can
> relate your use cases to Flink + Beam it would be interesting.
>
> I can imagine useful scenarios where Spark based analytics would be
> combined with Flink based analytics going through parquet.
>
> But do not know Flink enough to see where Hudi like functionality would
> fit.
>
> Could you provide such use cases (all kinds above and others) ? Ideally
> with code references.
>
> Depending on how serious your interest is we can go deeper in wiki.
>
> Thanks
>
> Nick
>
>
>
>
> On March 17, 2019 at 3:34 AM Vinoth Chandar < [email protected]> wrote:
>
>
> Hi Taher,
>
> Thanks for kicking off this thread. We can use this itself to discuss
> Flink. Hudi uses Spark today on the writing side and the micro-batch model
> actually fits very well. Given cloud stores don't support appends anyway,
> we would end up micro-batching nonetheless even with Flink. Abstracting
> out
> Spark would be a large effort (gets me to think, if we should then just
> rewrite on top of Beam ;)) and I have not thought of any unique advantages
> we get for Hudi by adding Flink. Do you have something in mind?
>
> If you can expand on where the gaps are with only having Spark/Hudi,
> that'd
> be really educative..
>
>
> Thanks
> Vinoth
>
> On Sat, Mar 16, 2019 at 11:17 PM Taher Koitawala <
> [email protected]>
> wrote:
>
> Hi Prasanna,
> Thank you for your reply. Should we start a discussion or open a jira
> on this regard then?
>
> On Sun, 17 Mar, 2019, 11:36 AM Prasanna, < [email protected]> wrote:
>
> Hello,
>
> I dont know of any effort to write hudi with flink.
>
>    - Prasanna
>
> On Sat, Mar 16, 2019 at 10:44 PM Taher Koitawala <
> [email protected]>
> wrote:
>
> Hey Guys, Any inputs on this?
>
> On Sat, 16 Mar, 2019, 12:35 PM Taher Koitawala, <
>
> [email protected]
>
> wrote:
>
> Hi Guys, I have recently been exploring about Hudi. It manages to
>
> solve a
>
> lot of our current use cases however my question is Can I use flink
>
> with
>
> Hudi? So far I have only seen spark integration with Hudi.
>
> Flink being more of a real-time processing engine rather than near
>
> real
>
> time and with its rich functions like Checkpointing for fault
>
> tolerance,
>
> States for instream computations, better windowing capabilities and
>
> very
>
> high stream throughput, and the exactly once semantics from source to
>
> sink.
>
> Flink is capable of being a part of Hudi to solve our instream use
>
> cases.
>
> >
>
> Regards,
> Taher Koitawala
> GS Lab Pune
> +91 8407979163
>
>
>
>
>

Re: Flink With Hudi (in memory Parquet through Arrow)

Reply via email to