Re: Flink With Hudi (in memory Parquet through Arrow)

Vinoth Chandar Wed, 20 Mar 2019 12:49:03 -0700

DeltaStreamer is just a tool that helps us ingest from kafka/dfs easily to
hudi. Don't think its of any conceptual importance for this discussion.


I am not sure if I understand your statements about - commit = data window,
streaming..

>>Now, on this basis, I can imagine a virtual dataset Arrow based cache
built (in parallel ?) with DFS writing business as usual, which would be
built from a stream of such data window descriptors.
yes. we could double write to DFS and an Arrow cache, if that's what you
mean. But how to provide transactionality across both writes would be an
open item.

I am planning to spend time on the wiki page today or tomorrow. Will leave
inline comments there and bump this thread.



On Tue, Mar 19, 2019 at 2:29 PM Semantic Beeng <[email protected]>
wrote:

> @vc - mulling on this.
>
> Indeed a key question is where Arrow fits: after the usual persistence is
> done to Hudi dataset or in parallel to it.
>
> Part of my doubt (ignorance ?)  comes around the DeltaStreamer - is that
> even important here?
>
> I built a file based data lake and converted raw files to parquet Hudi
> datasets without streaming the data itself at all.
>
> Once in Hudi datasets each commit becomes a physical "data window" - we
> used a physical id to distinguish.
>
> Now what can be "streamed" in the subsequent analyses is a descriptor of
> these physical "data window"s.
>
> The windows themselves can be very large - why stream the data since such
> a list of descriptors can be enough because they point to the actual data ?
>
> If you can please validate this thinking so I can know if am missing some
> insight about the DeltaStreamer it would be great.
>
> I actually have a problem with streaming data itself because
>
> 1. create duplication
>
> 2. the approach suggests it has to be streamed over an over to various
> analyses in the entire analysis ecosystem
>
> Instead, the mindset I suggest above is compatible with the Dremio
> "virtual dataset" (point, don't copy) mindset as captured in
> https://cwiki.apache.org/confluence/display/HUDI/Hudi+for+Continuous+Deep+Analytics
> .
>
> Now, on this basis, I can imagine a virtual dataset Arrow based cache
> built (in parallel ?) with DFS writing business as usual, which would be
> built from a stream of such data window descriptors.
>
> This sounds awful like what the Dremio presentation linked in wiki seems
> to suggest when they talk about this "refresher".
>
> Would you please refer to the presentation for context ? :-)
>
> This is critical to my idea / vision of how Hudi can be at the foundation
> of a file based data ecosystem and not just a tool that copies data around
> when told, without a context of "why".
>
> Anyway, this is best  I can do over email. :-(
>
> Please let me know.
>
> Cheers
>
> Nick
>
>
>
>
>
>
> On March 19, 2019 at 1:56 PM Vinoth Chandar <[email protected]> wrote:
>
>
> On Beam, I tested over a year ago and Spark batch/Flink streaming under
> Beam worked well IMO. Our mentors both know this area very well. May be
> they can chime in as well? :)
>
> I think I understand Arrow fairly well. It can read parquet data into
> memory efficiently using zero-copy etc and then lay it out optimized for
> vectorization by queries. Even if you read/write parquet on disk, the
> bottleneck is often CPU and not really I/O. Arrow helps with all this. We
> have thought about building a mutable cache using Arrow before, to improve
> real-time view performance.
>
> Something like :
> Source => Hudi DeltaStreamer => Hudi Dataset on DFS => Incremental Pull =>
> Arrow cache
>
> Hudi works with a DFS abstraction (HDFS,S3, igfs etc) and manages upserts
> on them. Are you suggesting Hudi work with Arrow buffers directly?
>
>
> Thanks
> Vinoth
>
>
>
>
> On Mon, Mar 18, 2019 at 1:35 PM Semantic Beeng <[email protected]>
> wrote:
>
> @vc I also see Beam working well as a common denominator between Spark and
> Fink.
>
> Is there any through coverage of how Beam work in regards with Spark and
> Flink? We could use that as a source of usage examples.
>
> I know of Scio and Featran but not well enough.
>
> https://github.com/spotify/scio
>
> >
>
>
> https://github.com/spotify/scio/tree/master/scio-examples/src/main/scala/com/spotify/scio/examples/cookbook
>
> @vc an idea triggered by your 1 below.
>
> If one (me) is interested to compose "analyses" closely (without writing
> to disk or kafka) then would it make sense to "write to Parquet" in memory
> through Apache Arrow ?
>
> https://arrow.apache.org/docs/python/parquet.html
>
> "Apache Arrow is an ideal in-memory transport layer for data that is being
> read or written with Parquet files."
>
> https://arrow.apache.org/docs/format/IPC.html
>
> Is this a possibly good idea ?
>
> I can see how this does not make sense because Hudi maintains a
> sophisticated handle of the actual files. (mulling)
>
> But maybe there is a way to "stream: the updates also over Arrow somehow
> (maybe we can have some abstractions over how updates are "streamed" in or
> related to this DeltaStreamer somehow)
>
> If you think it has value then will elaborate a bit more as a use case in
> wiki.
>
> The nice thing is that Arrow gives a shared runtime : can have f1 in JVM
> writing and f2 in CPython reading while sharing same off-heap/native
> memory.
>
> Thoughts?
>
> Any sense of what it would take to abstract at this level in Hudi ?
>
> Cheers
>
> Nick
>
> >
>
> On March 18, 2019 at 3:45 PM Vinoth Chandar <[email protected]> wrote:
>
> Sorry for the late reply. Busy Sunday :)
>
> First off, this is a very interesting topic.
>
> @Semantic Beeng <[email protected]> , +1 It would be good to lay out
> the current use-cases not met by Spark execution specifically related to
> Hudi's write path..
>
> @taher, Definitely Flink has its advantages like you mention.
> Two cases where I thought direct Flink support for writing datasets would
> be good are :
>
> 1) Capture result of Flink jobs and write it out as Hudi dataset. But one
> can always write a Flink job to compute the results and then store it in
> Hudi, using it as a Sink?
> Something like : Kafka => Flink => Kafka => DeltaStreamer => Hudi on dfs
> 2) If someone is not using Spark at all, then Hudi brings it in and
> potentially increases ops costs?
>
> On Beam, while its admittedly new, if we were to abstract away Spark and
> Flink from Hudi code (once again non-trivial amount of work ;)), then Beam
> is very attractive.
> We will end up inventing a Beam-- otherwise anyway :)
>
> Thanks
> Vinoth
>
> On Sun, Mar 17, 2019 at 10:22 PM Semantic Beeng < [email protected]>
> wrote:
>
> Hello Taher,
>
> Vinoth, was looking for such an assessment from you - thanks. :-)
>
> Taher - at the high level it sounds interesting to explore some Flink
> specific or common use cases, I think.
>
> We are have discussions about integrating Hudi with Beam so if you can
> relate your use cases to Flink + Beam it would be interesting.
>
> I can imagine useful scenarios where Spark based analytics would be
> combined with Flink based analytics going through parquet.
>
> But do not know Flink enough to see where Hudi like functionality would
> fit.
>
> Could you provide such use cases (all kinds above and others) ? Ideally
> with code references.
>
> Depending on how serious your interest is we can go deeper in wiki.
>
> Thanks
>
> Nick
>
> >
> >
> >
>
> On March 17, 2019 at 3:34 AM Vinoth Chandar < [email protected]> wrote:
>
> >
>
> Hi Taher,
>
> Thanks for kicking off this thread. We can use this itself to discuss
> Flink. Hudi uses Spark today on the writing side and the micro-batch model
> actually fits very well. Given cloud stores don't support appends anyway,
> we would end up micro-batching nonetheless even with Flink. Abstracting
> out
> Spark would be a large effort (gets me to think, if we should then just
> rewrite on top of Beam ;)) and I have not thought of any unique advantages
> we get for Hudi by adding Flink. Do you have something in mind?
>
> If you can expand on where the gaps are with only having Spark/Hudi,
> that'd
> be really educative..
>
> >
>
> Thanks
> Vinoth
>
> On Sat, Mar 16, 2019 at 11:17 PM Taher Koitawala <
> [email protected]>
> wrote:
>
> Hi Prasanna,
> Thank you for your reply. Should we start a discussion or open a jira
> on this regard then?
>
> On Sun, 17 Mar, 2019, 11:36 AM Prasanna, < [email protected]> wrote:
>
> Hello,
>
> I dont know of any effort to write hudi with flink.
>
> - Prasanna
>
> On Sat, Mar 16, 2019 at 10:44 PM Taher Koitawala <
> [email protected]>
> wrote:
>
> Hey Guys, Any inputs on this?
>
> On Sat, 16 Mar, 2019, 12:35 PM Taher Koitawala, <
>
> [email protected]
>
> wrote:
>
> Hi Guys, I have recently been exploring about Hudi. It manages to
>
> solve a
>
> lot of our current use cases however my question is Can I use flink
>
> with
>
> Hudi? So far I have only seen spark integration with Hudi.
>
> Flink being more of a real-time processing engine rather than near
>
> real
>
> time and with its rich functions like Checkpointing for fault
>
> tolerance,
>
> States for instream computations, better windowing capabilities and
>
> very
>
> high stream throughput, and the exactly once semantics from source to
>
> sink.
>
> Flink is capable of being a part of Hudi to solve our instream use
>
> cases.
>
> Regards,
> Taher Koitawala
> GS Lab Pune
> +91 8407979163
>
> >
> >
> >
> >
>
>

Re: Flink With Hudi (in memory Parquet through Arrow)

Reply via email to