Re: [DISCUSS] Decouple Hudi and Spark

taher koitawala Tue, 06 Aug 2019 09:49:06 -0700

Hi Vinoth,
        Are there some tasks I can take up to ramp up the code? Want to get
more used to the code and understand the existing implementation better.


Thanks,
Taher Koitawala

On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar <vin...@apache.org> wrote:

> Let's see if others have any thoughts as well. We can plan to fix the
> approach by EOW.
>
> On Mon, Aug 5, 2019 at 7:06 PM vino yang <yanghua1...@gmail.com> wrote:
>
> > Hi guys,
> >
> > Also, +1 for Approach 1 like Taher.
> >
> > > If we can do a comprehensive analysis of this model and come up with.
> > means
> > > to refactor this cleanly, this would be promising.
> >
> > Yes, when we get the conclusion, we could start this work.
> >
> > Best,
> > Vino
> >
> >
> > taher koitawala <taher...@gmail.com> 于2019年8月6日周二 上午12:28写道：
> >
> > > +1 for Approch 1 Point integration with each framework
> > >
> > > Approach 2 has a problem as you said "Developers need to think about
> > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this
> > may
> > > not be the panacea that it seems to be"
> > >
> > > We have seen various pipelines in the beam dag being expressed
> > differently
> > > then we had them in our original usecase. And also switching between
> > spark
> > > and Flink runners in beam have various impact on the pipelines like
> some
> > > features available in Flink are not available on the spark runner etc.
> > > Refer to this compatible matrix ->
> > > https://beam.apache.org/documentation/runners/capability-matrix/
> > >
> > > Hence my vote on Approch 1 let's decouple and build the abstract for
> each
> > > framework. That is a much better option. We will also have more control
> > > over each framework's implement.
> > >
> > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <vin...@apache.org> wrote:
> > >
> > > > Would like to highlight that there are two distinct approaches here
> > with
> > > > different tradeoffs. Think of this as my braindump, as I have been
> > > thinking
> > > > about this quite a bit in the past.
> > > >
> > > >
> > > > *Approach 1 : Point integration with each framework *
> > > >
> > > > >>We may need a pure client module named for example
> > > > hoodie-client-core(common)
> > > > >> Then we could have: hoodie-client-spark, hoodie-client-flink
> > > > and hoodie-client-beam
> > > >
> > > > (+) This is the safest to do IMO, since we can isolate the current
> > Spark
> > > > execution (hoodie-spark, hoodie-client-spark) from the changes for
> > flink,
> > > > while it stabilizes over few releases.
> > > > (-) Downside is that the utilities needs to be redone :
> > > >  hoodie-utilities-spark and hoodie-utilities-flink and
> > > > hoodie-utilities-core ? hoodie-cli?
> > > >
> > > > If we can do a comprehensive analysis of this model and come up with.
> > > means
> > > > to refactor this cleanly, this would be promising.
> > > >
> > > >
> > > > *Approach 2: Beam as the compute abstraction*
> > > >
> > > > Another more drastic approach is to remove Spark as the compute
> > > abstraction
> > > > for writing data and replace it with Beam.
> > > >
> > > > (+) All of the code remains more or less similar and there is one
> > compute
> > > > API to reason about.
> > > >
> > > > (-) The (very big) assumption here is that we are able to tune the
> > spark
> > > > runtime the same way using Beam : custom partitioners, support for
> all
> > > RDD
> > > > operations we invoke, caching etc etc.
> > > > (-) It will be a massive rewrite and testing of such a large rewrite
> > > would
> > > > also be really challenging, since we need to pay attention to all
> > > intricate
> > > > details to ensure the spark users today experience no
> > > > regressions/side-effects
> > > > (-) Note that we still need to probably support the hoodie-spark
> module
> > > and
> > > > may be a first-class such integration with flink, for native
> > flink/spark
> > > > pipeline authoring. Users of say DeltaStreamer need to pass in Spark
> or
> > > > Flink configs anyway..  Developers need to think about
> > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> this
> > > may
> > > > not be the panacea that it seems to be.
> > > >
> > > >
> > > >
> > > > One goal for the HIP is to get us all to agree as a community which
> one
> > > to
> > > > pick, with sufficient investigation, testing, benchmarking..
> > > >
> > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <yanghua1...@gmail.com>
> > wrote:
> > > >
> > > > > +1 for both Beam and Flink
> > > > >
> > > > > > First step here is to probably draw out current hierrarchy and
> > figure
> > > > out
> > > > > > what the abstraction points are..
> > > > > > In my opinion, the runtime (spark, flink) should be done at the
> > > > > > hoodie-client level and just used by hoodie-utilties seamlessly..
> > > > >
> > > > > +1 for Vinoth's opinion, it should be the first step.
> > > > >
> > > > > No matter we hope Hudi to integrate with which computing framework.
> > > > > We need to decouple Hudi client and Spark.
> > > > >
> > > > > We may need a pure client module named for example
> > > > > hoodie-client-core(common)
> > > > >
> > > > > Then we could have: hoodie-client-spark, hoodie-client-flink and
> > > > > hoodie-client-beam
> > > > >
> > > > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道：
> > > > >
> > > > > > +1 for Beam -- agree with Semantic Beeng's analysis.
> > > > > >
> > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <
> > taher...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > So the way to go around this is that file a hip. Chalk all th
> > > classes
> > > > > our
> > > > > > > and start moving towards Pure client.
> > > > > > >
> > > > > > > Secondly should we want to try beam?
> > > > > > >
> > > > > > > I think there is to much going on here and I'm not able to
> > follow.
> > > If
> > > > > we
> > > > > > > want to try out beam all along I don't think it makes sense to
> do
> > > > > > anything
> > > > > > > on Flink then.
> > > > > > >
> > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
> > > n...@semanticbeeng.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> +1 My money is on this approach.
> > > > > > >>
> > > > > > >> The existing abstractions from Beam seem enough for the use
> > cases
> > > > as I
> > > > > > >> imagine them.
> > > > > > >>
> > > > > > >> Flink also has "dynamic table", "table source" and "table
> sink"
> > > > which
> > > > > > >> seem very useful abstractions where Hudi might fit nicely.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > > > > >>
> > > > > > >>
> > > > > > >> Attached a screen shot.
> > > > > > >>
> > > > > > >> This seems to fit with the original premise of Hudi as well.
> > > > > > >>
> > > > > > >> Am exploring this venue with a use case that involves
> "temporal
> > > > joins
> > > > > on
> > > > > > >> streams" which I need for feature extraction.
> > > > > > >>
> > > > > > >> Anyone is interested in this or has concrete enough needs and
> > use
> > > > > cases
> > > > > > >> please let me know.
> > > > > > >>
> > > > > > >> Best to go from an agreed upon set of 2-3 use cases.
> > > > > > >>
> > > > > > >> Cheers
> > > > > > >>
> > > > > > >> Nick
> > > > > > >>
> > > > > > >>
> > > > > > >> > Also, we do have some Beam experts on the mailing list.. Can
> > you
> > > > > > please
> > > > > > >> weigh on viability of using Beam as the intermediate
> abstraction
> > > > here
> > > > > > >> between Spark/Flink?
> > > > > > >> Hudi uses RDD apis like groupBy, mapToPair,
> sortAndRepartition,
> > > > > > >> reduceByKey, countByKey and also does custom partitioning a
> > lot.>
> > > > > > >>
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to