Re: [DISCUSS] Decouple Hudi and Spark

Vinoth Chandar Tue, 06 Aug 2019 09:32:31 -0700

Let's see if others have any thoughts as well. We can plan to fix the
approach by EOW.


On Mon, Aug 5, 2019 at 7:06 PM vino yang <yanghua1...@gmail.com> wrote:

> Hi guys,
>
> Also, +1 for Approach 1 like Taher.
>
> > If we can do a comprehensive analysis of this model and come up with.
> means
> > to refactor this cleanly, this would be promising.
>
> Yes, when we get the conclusion, we could start this work.
>
> Best,
> Vino
>
>
> taher koitawala <taher...@gmail.com> 于2019年8月6日周二 上午12:28写道：
>
> > +1 for Approch 1 Point integration with each framework
> >
> > Approach 2 has a problem as you said "Developers need to think about
> > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this
> may
> > not be the panacea that it seems to be"
> >
> > We have seen various pipelines in the beam dag being expressed
> differently
> > then we had them in our original usecase. And also switching between
> spark
> > and Flink runners in beam have various impact on the pipelines like some
> > features available in Flink are not available on the spark runner etc.
> > Refer to this compatible matrix ->
> > https://beam.apache.org/documentation/runners/capability-matrix/
> >
> > Hence my vote on Approch 1 let's decouple and build the abstract for each
> > framework. That is a much better option. We will also have more control
> > over each framework's implement.
> >
> > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <vin...@apache.org> wrote:
> >
> > > Would like to highlight that there are two distinct approaches here
> with
> > > different tradeoffs. Think of this as my braindump, as I have been
> > thinking
> > > about this quite a bit in the past.
> > >
> > >
> > > *Approach 1 : Point integration with each framework *
> > >
> > > >>We may need a pure client module named for example
> > > hoodie-client-core(common)
> > > >> Then we could have: hoodie-client-spark, hoodie-client-flink
> > > and hoodie-client-beam
> > >
> > > (+) This is the safest to do IMO, since we can isolate the current
> Spark
> > > execution (hoodie-spark, hoodie-client-spark) from the changes for
> flink,
> > > while it stabilizes over few releases.
> > > (-) Downside is that the utilities needs to be redone :
> > >  hoodie-utilities-spark and hoodie-utilities-flink and
> > > hoodie-utilities-core ? hoodie-cli?
> > >
> > > If we can do a comprehensive analysis of this model and come up with.
> > means
> > > to refactor this cleanly, this would be promising.
> > >
> > >
> > > *Approach 2: Beam as the compute abstraction*
> > >
> > > Another more drastic approach is to remove Spark as the compute
> > abstraction
> > > for writing data and replace it with Beam.
> > >
> > > (+) All of the code remains more or less similar and there is one
> compute
> > > API to reason about.
> > >
> > > (-) The (very big) assumption here is that we are able to tune the
> spark
> > > runtime the same way using Beam : custom partitioners, support for all
> > RDD
> > > operations we invoke, caching etc etc.
> > > (-) It will be a massive rewrite and testing of such a large rewrite
> > would
> > > also be really challenging, since we need to pay attention to all
> > intricate
> > > details to ensure the spark users today experience no
> > > regressions/side-effects
> > > (-) Note that we still need to probably support the hoodie-spark module
> > and
> > > may be a first-class such integration with flink, for native
> flink/spark
> > > pipeline authoring. Users of say DeltaStreamer need to pass in Spark or
> > > Flink configs anyway..  Developers need to think about
> > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this
> > may
> > > not be the panacea that it seems to be.
> > >
> > >
> > >
> > > One goal for the HIP is to get us all to agree as a community which one
> > to
> > > pick, with sufficient investigation, testing, benchmarking..
> > >
> > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <yanghua1...@gmail.com>
> wrote:
> > >
> > > > +1 for both Beam and Flink
> > > >
> > > > > First step here is to probably draw out current hierrarchy and
> figure
> > > out
> > > > > what the abstraction points are..
> > > > > In my opinion, the runtime (spark, flink) should be done at the
> > > > > hoodie-client level and just used by hoodie-utilties seamlessly..
> > > >
> > > > +1 for Vinoth's opinion, it should be the first step.
> > > >
> > > > No matter we hope Hudi to integrate with which computing framework.
> > > > We need to decouple Hudi client and Spark.
> > > >
> > > > We may need a pure client module named for example
> > > > hoodie-client-core(common)
> > > >
> > > > Then we could have: hoodie-client-spark, hoodie-client-flink and
> > > > hoodie-client-beam
> > > >
> > > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道：
> > > >
> > > > > +1 for Beam -- agree with Semantic Beeng's analysis.
> > > > >
> > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <
> taher...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > So the way to go around this is that file a hip. Chalk all th
> > classes
> > > > our
> > > > > > and start moving towards Pure client.
> > > > > >
> > > > > > Secondly should we want to try beam?
> > > > > >
> > > > > > I think there is to much going on here and I'm not able to
> follow.
> > If
> > > > we
> > > > > > want to try out beam all along I don't think it makes sense to do
> > > > > anything
> > > > > > on Flink then.
> > > > > >
> > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
> > n...@semanticbeeng.com>
> > > > > > wrote:
> > > > > >
> > > > > >> +1 My money is on this approach.
> > > > > >>
> > > > > >> The existing abstractions from Beam seem enough for the use
> cases
> > > as I
> > > > > >> imagine them.
> > > > > >>
> > > > > >> Flink also has "dynamic table", "table source" and "table sink"
> > > which
> > > > > >> seem very useful abstractions where Hudi might fit nicely.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > > > >>
> > > > > >>
> > > > > >> Attached a screen shot.
> > > > > >>
> > > > > >> This seems to fit with the original premise of Hudi as well.
> > > > > >>
> > > > > >> Am exploring this venue with a use case that involves "temporal
> > > joins
> > > > on
> > > > > >> streams" which I need for feature extraction.
> > > > > >>
> > > > > >> Anyone is interested in this or has concrete enough needs and
> use
> > > > cases
> > > > > >> please let me know.
> > > > > >>
> > > > > >> Best to go from an agreed upon set of 2-3 use cases.
> > > > > >>
> > > > > >> Cheers
> > > > > >>
> > > > > >> Nick
> > > > > >>
> > > > > >>
> > > > > >> > Also, we do have some Beam experts on the mailing list.. Can
> you
> > > > > please
> > > > > >> weigh on viability of using Beam as the intermediate abstraction
> > > here
> > > > > >> between Spark/Flink?
> > > > > >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> > > > > >> reduceByKey, countByKey and also does custom partitioning a
> lot.>
> > > > > >>
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to