Re: [DISCUSS] Decouple Hudi and Spark

taher koitawala Tue, 13 Aug 2019 02:43:13 -0700

Hi Vino,
      According to what I've seen Hudi has a lot of spark component flowing
throwing it. Like Taskcontexts, JavaSparkContexts etc. The main classes I
guess we should focus upon is HoodieTable and Hoodie write clients.


Also Vino, I don't think we should be providing Flink dataset
implementation. We should only stick to Flink Streaming.
               Furthermore if there is a requirement for batch then users
should use Spark or then we will anyway have a beam integration coming up.

As of cache, How about we write our stateful Flink function and use
RocksDbStateBackend with some state TTL.

On Tue, Aug 13, 2019, 2:28 PM vino yang <yanghua1...@gmail.com> wrote:

> Hi all,
>
> After doing some research, let me share my information:
>
>
>    - Limitation of computing engine capabilities: Hudi uses Spark's
>    RDD#persist, and Flink currently has no API to cache datasets. Maybe we
> can
>    only choose to use external storage or do not use cache? For the use of
>    other APIs, the two currently offer almost equivalent capabilities.
>    - The abstraction of the computing engine is different: Considering the
>    different usage scenarios of the computing engine in Hudi, Flink has not
>    yet implemented stream batch unification, so we may use both Flink's
>    DataSet API (batch processing) and DataStream API (stream processing).
>
> Best,
> Vino
>
> nishith agarwal <n3.nas...@gmail.com> 于2019年8月8日周四 上午12:57写道：
>
> > Nick,
> >
> > You bring up a good point about the non-trivial programming model
> > differences between these different technologies. From a theoretical
> > perspective, I'd say considering a higher level abstraction makes sense.
> I
> > think we have to decouple some objectives and concerns here.
> >
> > a) The immediate desire is to have Hudi be able to run on a Flink (or
> > non-spark) engine. This naturally begs the question of decoupling Hudi
> > concepts from direct Spark dependencies.
> >
> > b) If we do want to initiate the above effort, would it make sense to
> just
> > have a higher level abstraction, building on other technologies like beam
> > (euphoria etc) and provide single, clean API's that may be more
> > maintainable from a code perspective. But at the same time this will
> > introduce challenges on how to maintain efficiency and optimized runtime
> > dags for Hudi (since the code would move away from point integrations and
> > whenever this happens, tuning natively for specific engines becomes more
> > and more difficult).
> >
> > My general opinion is that, as the community grows over time with more
> > folks having an in-depth understanding of Hudi, going from current_state
> ->
> > (a) -> (b) might be the most reliable and adoptable path for this
> project.
> >
> > Thanks,
> > Nishith
> >
> > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng <n...@semanticbeeng.com>
> > wrote:
> >
> > > There are some not trivial difference between programming model and
> > > runtime semantics between Beam, Spark and Flink.
> > >
> > >
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > >
> > > Nitish, Vino - thoughts?
> > >
> > > Does it feel to consider a higher level abstraction / DSL instead of
> > > maintaining different code with same functionality but different
> > > programming models ?
> > >
> > > https://beam.apache.org/documentation/sdks/java/euphoria/
> > >
> > > Nick
> > >
> > >
> > >
> > >
> > > On August 6, 2019 at 4:04 PM nishith agarwal <n3.nas...@gmail.com>
> > wrote:
> > >
> > >
> > > +1 for Approach 1 Point integration with each framework.
> > >
> > > Pros for point integration
> > >
> > >    - Hudi community is already familiar with spark and spark based
> > >
> > >
> > > actions/shuffles etc. Since both modules can be decoupled, this enables
> > us
> > > to have a steady release for Hudi for 1 execution engine (spark) while
> we
> > > hone our skills and iterate on making flink dag optimized, performant
> > with
> > > the right configuration.
> > >
> > >    - This might be a stepping stone towards rewriting the entire code
> > base
> > >
> > >
> > > being agnostic of spark/flink. This approach will help us fix tests,
> > > intricacies and help make the code base ready for a larger rework.
> > >
> > >    - Seems like the easiest way to add flink support
> > >
> > >
> > >
> > > Cons
> > >
> > >    - More code paths to maintain and reason since the spark and flink
> > >
> > >
> > > integrations will naturally diverge over time.
> > >
> > > Theoretically, I do like the idea of being able to run the hudi dag on
> > beam
> > > more than point integrations, where there is one API/logic to reason
> > about.
> > > But practically, that may not be the right direction.
> > >
> > > Pros
> > >
> > >    - Lesser cognitive burden in maintaining, evolving and releasing the
> > >
> > >
> > > project with one API to reason with.
> > >
> > >    - Theoretically, going forward assuming beam is adopted as a
> standard
> > >
> > >
> > > programming paradigm for stream/batch, this would enable consumers
> > leverage
> > > the power of hudi more easily.
> > >
> > > Cons
> > >
> > >    - Massive rewrite of the code base. Additionally, since we would
> have
> > >    moved
> > >
> > >
> > > away from directly using spark APIs, there is a bigger risk of
> > regression.
> > > We would have to be very thorough with all the intricacies and ensure
> the
> > > same stability of new releases.
> > >
> > >    - Managing future features (which may be very spark driven) will
> > either
> > >
> > >
> > > clash or pause or will need to be reworked.
> > >
> > >    - Tuning jobs for Spark/Flink type execution frameworks individually
> > >    might
> > >
> > >
> > > be difficult and will get difficult over time as the project evolves,
> > where
> > > some beam integrations with spark/flink may not work as expected.
> > >
> > >    - Also, as pointed above, need to probably support the hoodie-spark
> > >    module
> > >
> > >
> > > as a first-class.
> > >
> > > Thank,
> > > Nishith
> > >
> > >
> > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala <taher...@gmail.com>
> > wrote:
> > >
> > > Hi Vinoth,
> > > Are there some tasks I can take up to ramp up the code? Want to get
> > > more used to the code and understand the existing implementation
> better.
> > >
> > > Thanks,
> > > Taher Koitawala
> > >
> > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar <vin...@apache.org>
> wrote:
> > >
> > > Let's see if others have any thoughts as well. We can plan to fix the
> > > approach by EOW.
> > >
> > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <yanghua1...@gmail.com>
> wrote:
> > >
> > > Hi guys,
> > >
> > > Also, +1 for Approach 1 like Taher.
> > >
> > > If we can do a comprehensive analysis of this model and come up with.
> > >
> > > means
> > >
> > > to refactor this cleanly, this would be promising.
> > >
> > > Yes, when we get the conclusion, we could start this work.
> > >
> > > Best,
> > > Vino
> > >
> > > >
> > >
> > > taher koitawala <taher...@gmail.com> 于2019年8月6日周二 上午12:28写道：
> > >
> > > +1 for Approch 1 Point integration with each framework
> > >
> > > Approach 2 has a problem as you said "Developers need to think about
> > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> > >
> > > this
> > >
> > > may
> > >
> > > not be the panacea that it seems to be"
> > >
> > > We have seen various pipelines in the beam dag being expressed
> > >
> > > differently
> > >
> > > then we had them in our original usecase. And also switching between
> > >
> > > spark
> > >
> > > and Flink runners in beam have various impact on the pipelines like
> > >
> > > some
> > >
> > > features available in Flink are not available on the spark runner
> > >
> > > etc.
> > >
> > > Refer to this compatible matrix ->
> > > https://beam.apache.org/documentation/runners/capability-matrix/
> > >
> > > Hence my vote on Approch 1 let's decouple and build the abstract for
> > >
> > > each
> > >
> > > framework. That is a much better option. We will also have more
> > >
> > > control
> > >
> > > over each framework's implement.
> > >
> > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <vin...@apache.org>
> > >
> > > wrote:
> > >
> > > Would like to highlight that there are two distinct approaches here
> > >
> > > with
> > >
> > > different tradeoffs. Think of this as my braindump, as I have been
> > >
> > > thinking
> > >
> > > about this quite a bit in the past.
> > >
> > > >
> > >
> > > *Approach 1 : Point integration with each framework *
> > >
> > > We may need a pure client module named for example
> > > hoodie-client-core(common)
> > > >> Then we could have: hoodie-client-spark, hoodie-client-flink
> > > and hoodie-client-beam
> > >
> > > (+) This is the safest to do IMO, since we can isolate the current
> > >
> > > Spark
> > >
> > > execution (hoodie-spark, hoodie-client-spark) from the changes for
> > >
> > > flink,
> > >
> > > while it stabilizes over few releases.
> > > (-) Downside is that the utilities needs to be redone :
> > > hoodie-utilities-spark and hoodie-utilities-flink and
> > > hoodie-utilities-core ? hoodie-cli?
> > >
> > > If we can do a comprehensive analysis of this model and come up
> > >
> > > with.
> > >
> > > means
> > >
> > > to refactor this cleanly, this would be promising.
> > >
> > > >
> > >
> > > *Approach 2: Beam as the compute abstraction*
> > >
> > > Another more drastic approach is to remove Spark as the compute
> > >
> > > abstraction
> > >
> > > for writing data and replace it with Beam.
> > >
> > > (+) All of the code remains more or less similar and there is one
> > >
> > > compute
> > >
> > > API to reason about.
> > >
> > > (-) The (very big) assumption here is that we are able to tune the
> > >
> > > spark
> > >
> > > runtime the same way using Beam : custom partitioners, support for
> > >
> > > all
> > >
> > > RDD
> > >
> > > operations we invoke, caching etc etc.
> > > (-) It will be a massive rewrite and testing of such a large
> > >
> > > rewrite
> > >
> > > would
> > >
> > > also be really challenging, since we need to pay attention to all
> > >
> > > intricate
> > >
> > > details to ensure the spark users today experience no
> > > regressions/side-effects
> > > (-) Note that we still need to probably support the hoodie-spark
> > >
> > > module
> > >
> > > and
> > >
> > > may be a first-class such integration with flink, for native
> > >
> > > flink/spark
> > >
> > > pipeline authoring. Users of say DeltaStreamer need to pass in
> > >
> > > Spark
> > >
> > > or
> > >
> > > Flink configs anyway.. Developers need to think about
> > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> > >
> > > this
> > >
> > > may
> > >
> > > not be the panacea that it seems to be.
> > >
> > > >
> > > >
> > >
> > > One goal for the HIP is to get us all to agree as a community which
> > >
> > > one
> > >
> > > to
> > >
> > > pick, with sufficient investigation, testing, benchmarking..
> > >
> > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <yanghua1...@gmail.com>
> > >
> > > wrote:
> > >
> > > +1 for both Beam and Flink
> > >
> > > First step here is to probably draw out current hierrarchy and
> > >
> > > figure
> > >
> > > out
> > >
> > > what the abstraction points are..
> > > In my opinion, the runtime (spark, flink) should be done at the
> > > hoodie-client level and just used by hoodie-utilties
> > >
> > > seamlessly..
> > >
> > > +1 for Vinoth's opinion, it should be the first step.
> > >
> > > No matter we hope Hudi to integrate with which computing
> > >
> > > framework.
> > >
> > > We need to decouple Hudi client and Spark.
> > >
> > > We may need a pure client module named for example
> > > hoodie-client-core(common)
> > >
> > > Then we could have: hoodie-client-spark, hoodie-client-flink and
> > > hoodie-client-beam
> > >
> > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道：
> > >
> > > +1 for Beam -- agree with Semantic Beeng's analysis.
> > >
> > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <
> > >
> > > taher...@gmail.com>
> > >
> > > wrote:
> > >
> > > So the way to go around this is that file a hip. Chalk all th
> > >
> > > classes
> > >
> > > our
> > >
> > > and start moving towards Pure client.
> > >
> > > Secondly should we want to try beam?
> > >
> > > I think there is to much going on here and I'm not able to
> > >
> > > follow.
> > >
> > > If
> > >
> > > we
> > >
> > > want to try out beam all along I don't think it makes sense
> > >
> > > to
> > >
> > > do
> > >
> > > anything
> > >
> > > on Flink then.
> > >
> > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
> > >
> > > n...@semanticbeeng.com>
> > >
> > > wrote:
> > >
> > > >> +1 My money is on this approach.
> > > >>
> > > >> The existing abstractions from Beam seem enough for the use
> > >
> > > cases
> > >
> > > as I
> > >
> > > imagine them.
> > >
> > > >> Flink also has "dynamic table", "table source" and "table
> > >
> > > sink"
> > >
> > > which
> > >
> > > seem very useful abstractions where Hudi might fit nicely.
> > >
> > > >>
> > > >>
> > >
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > >
> > > >>
> > > >> Attached a screen shot.
> > > >>
> > > >> This seems to fit with the original premise of Hudi as well.
> > > >>
> > > >> Am exploring this venue with a use case that involves
> > >
> > > "temporal
> > >
> > > joins
> > >
> > > on
> > >
> > > streams" which I need for feature extraction.
> > >
> > > >> Anyone is interested in this or has concrete enough needs
> > >
> > > and
> > >
> > > use
> > >
> > > cases
> > >
> > > please let me know.
> > >
> > > >> Best to go from an agreed upon set of 2-3 use cases.
> > > >>
> > > >> Cheers
> > > >>
> > > >> Nick
> > > >>
> > > >>
> > > >> > Also, we do have some Beam experts on the mailing list..
> > >
> > > Can
> > >
> > > you
> > >
> > > please
> > > >> weigh on viability of using Beam as the intermediate
> > >
> > > abstraction
> > >
> > > here
> > >
> > > between Spark/Flink?
> > > Hudi uses RDD apis like groupBy, mapToPair,
> > >
> > > sortAndRepartition,
> > >
> > > reduceByKey, countByKey and also does custom partitioning a
> > >
> > > lot.>
> > >
> > > >> >
> > > >>
> > > >
> > >
> > >
> >
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to