Re: [DISCUSS] Decouple Hudi and Spark

Vinoth Chandar Mon, 05 Aug 2019 08:59:39 -0700

Would like to highlight that there are two distinct approaches here with
different tradeoffs. Think of this as my braindump, as I have been thinking
about this quite a bit in the past.

*Approach 1 : Point integration with each framework *

>>We may need a pure client module named for example
hoodie-client-core(common)
>> Then we could have: hoodie-client-spark, hoodie-client-flink
and hoodie-client-beam

(+) This is the safest to do IMO, since we can isolate the current Spark
execution (hoodie-spark, hoodie-client-spark) from the changes for flink,
while it stabilizes over few releases.
(-) Downside is that the utilities needs to be redone :
 hoodie-utilities-spark and hoodie-utilities-flink and
hoodie-utilities-core ? hoodie-cli?

If we can do a comprehensive analysis of this model and come up with. means
to refactor this cleanly, this would be promising.

*Approach 2: Beam as the compute abstraction*

Another more drastic approach is to remove Spark as the compute abstraction
for writing data and replace it with Beam.

(+) All of the code remains more or less similar and there is one compute
API to reason about.

(-) The (very big) assumption here is that we are able to tune the spark
runtime the same way using Beam : custom partitioners, support for all RDD
operations we invoke, caching etc etc.
(-) It will be a massive rewrite and testing of such a large rewrite would
also be really challenging, since we need to pay attention to all intricate
details to ensure the spark users today experience no
regressions/side-effects
(-) Note that we still need to probably support the hoodie-spark module and
may be a first-class such integration with flink, for native flink/spark
pipeline authoring. Users of say DeltaStreamer need to pass in Spark or
Flink configs anyway..  Developers need to think about
what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may
not be the panacea that it seems to be.

One goal for the HIP is to get us all to agree as a community which one to
pick, with sufficient investigation, testing, benchmarking..

On Sat, Aug 3, 2019 at 7:56 PM vino yang <yanghua1...@gmail.com> wrote:

> +1 for both Beam and Flink
>
> > First step here is to probably draw out current hierrarchy and figure out
> > what the abstraction points are..
> > In my opinion, the runtime (spark, flink) should be done at the
> > hoodie-client level and just used by hoodie-utilties seamlessly..
>
> +1 for Vinoth's opinion, it should be the first step.
>
> No matter we hope Hudi to integrate with which computing framework.
> We need to decouple Hudi client and Spark.
>
> We may need a pure client module named for example
> hoodie-client-core(common)
>
> Then we could have: hoodie-client-spark, hoodie-client-flink and
> hoodie-client-beam
>
> Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道：
>
> > +1 for Beam -- agree with Semantic Beeng's analysis.
> >
> > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <taher...@gmail.com>
> > wrote:
> >
> > > So the way to go around this is that file a hip. Chalk all th classes
> our
> > > and start moving towards Pure client.
> > >
> > > Secondly should we want to try beam?
> > >
> > > I think there is to much going on here and I'm not able to follow. If
> we
> > > want to try out beam all along I don't think it makes sense to do
> > anything
> > > on Flink then.
> > >
> > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <n...@semanticbeeng.com>
> > > wrote:
> > >
> > >> +1 My money is on this approach.
> > >>
> > >> The existing abstractions from Beam seem enough for the use cases as I
> > >> imagine them.
> > >>
> > >> Flink also has "dynamic table", "table source" and "table sink" which
> > >> seem very useful abstractions where Hudi might fit nicely.
> > >>
> > >>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > >>
> > >>
> > >> Attached a screen shot.
> > >>
> > >> This seems to fit with the original premise of Hudi as well.
> > >>
> > >> Am exploring this venue with a use case that involves "temporal joins
> on
> > >> streams" which I need for feature extraction.
> > >>
> > >> Anyone is interested in this or has concrete enough needs and use
> cases
> > >> please let me know.
> > >>
> > >> Best to go from an agreed upon set of 2-3 use cases.
> > >>
> > >> Cheers
> > >>
> > >> Nick
> > >>
> > >>
> > >> > Also, we do have some Beam experts on the mailing list.. Can you
> > please
> > >> weigh on viability of using Beam as the intermediate abstraction here
> > >> between Spark/Flink?
> > >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> > >> reduceByKey, countByKey and also does custom partitioning a lot.>
> > >>
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to