Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

vino yang Sun, 04 Aug 2019 01:22:30 -0700

Hi Nick,

Thank you for your more detailed thoughts, and I fully agree with your
thoughts about HudiLink, which should also be part of the long-term
planning of the Hudi Ecology.



*But I found that the angle of our thinking and the starting point are not
consistent. I pay more attention to the rationality of the existing
architecture and whether the dependence on the computing engine is
pluggable. Don't get me wrong, I know very well that although we have
different perspectives, these views have value for Hudi.*
Let me give more details on the discussion I made earlier.

Currently, multiple submodules of the Hudi project are tightly coupled to
Spark's design and dependencies. You can see that many of the class files
contain statements such as "import org.apache.spark.xxx".

I first put forward a discussion: "Integrate Hudi with Apache Flink", and
then came up with a discussion: "Decouple Hudi and Spark".

I think the word "Integrate" I used for the first discussion may not be
accurate enough. My intention is to make the computing engine used by Hudi
pluggable. Spark is equivalent to Hudi is just a library, it is not the
core of Hudi, it should not be strongly coupled with Hudi. The features
currently provided by Spark are also available from Flink. But in order to
achieve this, we need to decouple Hudi from the code level with the use of
Spark.

This makes sense both in terms of structural rationality and community
ecology.

Best,
Vino


Semantic Beeng <[email protected]> 于2019年8月4日周日 下午2:21写道：

> "+1 for both Beam and Flink" - what I propose implies this indeed.
>
> But/and am working from the desired functionality and a proposed design.
>
> (as opposed to starting with refactoring Hudi with the goal of close
> integration with Flink)
>
> I feel this is not necessary - but am not an expert in Hudi implementation.
>
> But am pretty sure it is not sufficient for the use cases I have in mind.
> The gist is using Hudi as a file based data lake + ML feature store that
> enables incremental analyses done with a combination of Flink, Beam, Spark,
> Tensorlflow (see Petastorm from UberEng for an idea.)
>
> Let us call this HudiLink from now on (think of it as a mediator, not
> another Hudi).
>
> The intuition behind looking at more then Flink is that both Beam and
> Flink have good design abstractions we might reuse and extend.
>
> Like I said before, do not believe in point to point integrations.
>
> Alternatively / in parallel,If you care to share your use cases it would
> be very useful. Working with explicit use cases helps others to relate and
> help.
>
> Also, if some of you know there believe in (see) value of refactoring Hudi
> implementation for a hard integration with Flink (but have no time to argue
> for it) ofc you please go ahead.
>
> That may be a valid bottom up approach but I cannot relate to it myself
> (due to lack of use cases).
>
> Working on a material on HudiLink - if any are interested I might publish
> when more mature.
>
> Hint: this was part of the inspiration https://eng.uber.com/michelangelo/
>
> One well thought use case will get you "in". :-) Kidding, ofc.
>
> Cheers
>
> Nick
>
>
> On August 3, 2019 at 10:55 PM vino yang <[email protected]> wrote:
>
>
> +1 for both Beam and Flink
>
> First step here is to probably draw out current hierrarchy and figure out
> what the abstraction points are..
> In my opinion, the runtime (spark, flink) should be done at the
> hoodie-client level and just used by hoodie-utilties seamlessly..
>
>
> +1 for Vinoth's opinion, it should be the first step.
>
> No matter we hope Hudi to integrate with which computing framework.
> We need to decouple Hudi client and Spark.
>
> We may need a pure client module named for example
> hoodie-client-core(common)
>
> Then we could have: hoodie-client-spark, hoodie-client-flink and
> hoodie-client-beam
>
> Suneel Marthi <[email protected]> 于2019年8月4日周日 上午10:45写道：
>
> +1 for Beam -- agree with Semantic Beeng's analysis.
>
> On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <[email protected]>
> wrote:
>
> So the way to go around this is that file a hip. Chalk all th classes our
> and start moving towards Pure client.
>
> Secondly should we want to try beam?
>
> I think there is to much going on here and I'm not able to follow. If we
> want to try out beam all along I don't think it makes sense to do anything
> on Flink then.
>
> On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <[email protected]>
> wrote:
>
> >> +1 My money is on this approach.
> >>
> >> The existing abstractions from Beam seem enough for the use cases as I
> >> imagine them.
> >>
> >> Flink also has "dynamic table", "table source" and "table sink" which
> >> seem very useful abstractions where Hudi might fit nicely.
> >>
> >>
> >>
>
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> >>
> >>
> >> Attached a screen shot.
> >>
> >> This seems to fit with the original premise of Hudi as well.
> >>
> >> Am exploring this venue with a use case that involves "temporal joins on
> >> streams" which I need for feature extraction.
> >>
> >> Anyone is interested in this or has concrete enough needs and use cases
> >> please let me know.
> >>
> >> Best to go from an agreed upon set of 2-3 use cases.
> >>
> >> Cheers
> >>
> >> Nick
> >>
> >>
> >> > Also, we do have some Beam experts on the mailing list.. Can you
> please
> >> weigh on viability of using Beam as the intermediate abstraction here
> >> between Spark/Flink?
> >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> >> reduceByKey, countByKey and also does custom partitioning a lot.>
> >>
> >> >
> >>
> >
>
>

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

Reply via email to