Re: [DISCUSS] Decouple Hudi and Spark

vino yang Tue, 13 Aug 2019 03:57:36 -0700

Hi Nick and Taher,

I just want to answer Nishith's question. Reference his old description
here:


> You can do a parallel investigation while we are deciding on the module
structure.  You could be looking at all the patterns in Hudi's Spark APIs
usage (RDD/DataSource/SparkContext) and see if such support can be achieved
in theory with Flink. If not, what is the workaround. Documenting such
patterns would be valuable when multiple engineers are working on it. For
e:g, Hudi relies on     (a) custom partitioning logic for upserts,     (b)
caching RDDs to avoid reruns of costly stages     (c) A Spark upsert task
knowing its spark partition/task/attempt ids

And just like the title of this thread, we are going to try to decouple
Hudi and Spark. That means we can run the whole Hudi without depending
Spark. So we need to analyze all the usage of Spark in Hudi.

Here we are not discussing the integration of Hudi and Flink in the
application layer. Instead, I want Hudi to be decoupled from Spark and
allow other engines (such as Flink) to replace Spark.

It can be divided into long-term goals and short-term goals. As Nishith
stated in a recent email.

I mentioned the Flink Batch API here because Hudi can connect with many
different Source/Sinks. Some file-based reads are not appropriate for Flink
Streaming.

Therefore, this is a comprehensive survey of the use of Spark in Hudi.

Best,
Vino


taher koitawala <[email protected]> 于2019年8月13日周二 下午5:43写道：

> Hi Vino,
>       According to what I've seen Hudi has a lot of spark component flowing
> throwing it. Like Taskcontexts, JavaSparkContexts etc. The main classes I
> guess we should focus upon is HoodieTable and Hoodie write clients.
>
> Also Vino, I don't think we should be providing Flink dataset
> implementation. We should only stick to Flink Streaming.
>                Furthermore if there is a requirement for batch then users
> should use Spark or then we will anyway have a beam integration coming up.
>
> As of cache, How about we write our stateful Flink function and use
> RocksDbStateBackend with some state TTL.
>
> On Tue, Aug 13, 2019, 2:28 PM vino yang <[email protected]> wrote:
>
> > Hi all,
> >
> > After doing some research, let me share my information:
> >
> >
> >    - Limitation of computing engine capabilities: Hudi uses Spark's
> >    RDD#persist, and Flink currently has no API to cache datasets. Maybe
> we
> > can
> >    only choose to use external storage or do not use cache? For the use
> of
> >    other APIs, the two currently offer almost equivalent capabilities.
> >    - The abstraction of the computing engine is different: Considering
> the
> >    different usage scenarios of the computing engine in Hudi, Flink has
> not
> >    yet implemented stream batch unification, so we may use both Flink's
> >    DataSet API (batch processing) and DataStream API (stream processing).
> >
> > Best,
> > Vino
> >
> > nishith agarwal <[email protected]> 于2019年8月8日周四 上午12:57写道：
> >
> > > Nick,
> > >
> > > You bring up a good point about the non-trivial programming model
> > > differences between these different technologies. From a theoretical
> > > perspective, I'd say considering a higher level abstraction makes
> sense.
> > I
> > > think we have to decouple some objectives and concerns here.
> > >
> > > a) The immediate desire is to have Hudi be able to run on a Flink (or
> > > non-spark) engine. This naturally begs the question of decoupling Hudi
> > > concepts from direct Spark dependencies.
> > >
> > > b) If we do want to initiate the above effort, would it make sense to
> > just
> > > have a higher level abstraction, building on other technologies like
> beam
> > > (euphoria etc) and provide single, clean API's that may be more
> > > maintainable from a code perspective. But at the same time this will
> > > introduce challenges on how to maintain efficiency and optimized
> runtime
> > > dags for Hudi (since the code would move away from point integrations
> and
> > > whenever this happens, tuning natively for specific engines becomes
> more
> > > and more difficult).
> > >
> > > My general opinion is that, as the community grows over time with more
> > > folks having an in-depth understanding of Hudi, going from
> current_state
> > ->
> > > (a) -> (b) might be the most reliable and adoptable path for this
> > project.
> > >
> > > Thanks,
> > > Nishith
> > >
> > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng <[email protected]>
> > > wrote:
> > >
> > > > There are some not trivial difference between programming model and
> > > > runtime semantics between Beam, Spark and Flink.
> > > >
> > > >
> > > >
> > >
> >
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how
> > > >
> > > > Nitish, Vino - thoughts?
> > > >
> > > > Does it feel to consider a higher level abstraction / DSL instead of
> > > > maintaining different code with same functionality but different
> > > > programming models ?
> > > >
> > > > https://beam.apache.org/documentation/sdks/java/euphoria/
> > > >
> > > > Nick
> > > >
> > > >
> > > >
> > > >
> > > > On August 6, 2019 at 4:04 PM nishith agarwal <[email protected]>
> > > wrote:
> > > >
> > > >
> > > > +1 for Approach 1 Point integration with each framework.
> > > >
> > > > Pros for point integration
> > > >
> > > >    - Hudi community is already familiar with spark and spark based
> > > >
> > > >
> > > > actions/shuffles etc. Since both modules can be decoupled, this
> enables
> > > us
> > > > to have a steady release for Hudi for 1 execution engine (spark)
> while
> > we
> > > > hone our skills and iterate on making flink dag optimized, performant
> > > with
> > > > the right configuration.
> > > >
> > > >    - This might be a stepping stone towards rewriting the entire code
> > > base
> > > >
> > > >
> > > > being agnostic of spark/flink. This approach will help us fix tests,
> > > > intricacies and help make the code base ready for a larger rework.
> > > >
> > > >    - Seems like the easiest way to add flink support
> > > >
> > > >
> > > >
> > > > Cons
> > > >
> > > >    - More code paths to maintain and reason since the spark and flink
> > > >
> > > >
> > > > integrations will naturally diverge over time.
> > > >
> > > > Theoretically, I do like the idea of being able to run the hudi dag
> on
> > > beam
> > > > more than point integrations, where there is one API/logic to reason
> > > about.
> > > > But practically, that may not be the right direction.
> > > >
> > > > Pros
> > > >
> > > >    - Lesser cognitive burden in maintaining, evolving and releasing
> the
> > > >
> > > >
> > > > project with one API to reason with.
> > > >
> > > >    - Theoretically, going forward assuming beam is adopted as a
> > standard
> > > >
> > > >
> > > > programming paradigm for stream/batch, this would enable consumers
> > > leverage
> > > > the power of hudi more easily.
> > > >
> > > > Cons
> > > >
> > > >    - Massive rewrite of the code base. Additionally, since we would
> > have
> > > >    moved
> > > >
> > > >
> > > > away from directly using spark APIs, there is a bigger risk of
> > > regression.
> > > > We would have to be very thorough with all the intricacies and ensure
> > the
> > > > same stability of new releases.
> > > >
> > > >    - Managing future features (which may be very spark driven) will
> > > either
> > > >
> > > >
> > > > clash or pause or will need to be reworked.
> > > >
> > > >    - Tuning jobs for Spark/Flink type execution frameworks
> individually
> > > >    might
> > > >
> > > >
> > > > be difficult and will get difficult over time as the project evolves,
> > > where
> > > > some beam integrations with spark/flink may not work as expected.
> > > >
> > > >    - Also, as pointed above, need to probably support the
> hoodie-spark
> > > >    module
> > > >
> > > >
> > > > as a first-class.
> > > >
> > > > Thank,
> > > > Nishith
> > > >
> > > >
> > > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala <[email protected]>
> > > wrote:
> > > >
> > > > Hi Vinoth,
> > > > Are there some tasks I can take up to ramp up the code? Want to get
> > > > more used to the code and understand the existing implementation
> > better.
> > > >
> > > > Thanks,
> > > > Taher Koitawala
> > > >
> > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar <[email protected]>
> > wrote:
> > > >
> > > > Let's see if others have any thoughts as well. We can plan to fix the
> > > > approach by EOW.
> > > >
> > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang <[email protected]>
> > wrote:
> > > >
> > > > Hi guys,
> > > >
> > > > Also, +1 for Approach 1 like Taher.
> > > >
> > > > If we can do a comprehensive analysis of this model and come up with.
> > > >
> > > > means
> > > >
> > > > to refactor this cleanly, this would be promising.
> > > >
> > > > Yes, when we get the conclusion, we could start this work.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > >
> > > >
> > > > taher koitawala <[email protected]> 于2019年8月6日周二 上午12:28写道：
> > > >
> > > > +1 for Approch 1 Point integration with each framework
> > > >
> > > > Approach 2 has a problem as you said "Developers need to think about
> > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> > > >
> > > > this
> > > >
> > > > may
> > > >
> > > > not be the panacea that it seems to be"
> > > >
> > > > We have seen various pipelines in the beam dag being expressed
> > > >
> > > > differently
> > > >
> > > > then we had them in our original usecase. And also switching between
> > > >
> > > > spark
> > > >
> > > > and Flink runners in beam have various impact on the pipelines like
> > > >
> > > > some
> > > >
> > > > features available in Flink are not available on the spark runner
> > > >
> > > > etc.
> > > >
> > > > Refer to this compatible matrix ->
> > > > https://beam.apache.org/documentation/runners/capability-matrix/
> > > >
> > > > Hence my vote on Approch 1 let's decouple and build the abstract for
> > > >
> > > > each
> > > >
> > > > framework. That is a much better option. We will also have more
> > > >
> > > > control
> > > >
> > > > over each framework's implement.
> > > >
> > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar <[email protected]>
> > > >
> > > > wrote:
> > > >
> > > > Would like to highlight that there are two distinct approaches here
> > > >
> > > > with
> > > >
> > > > different tradeoffs. Think of this as my braindump, as I have been
> > > >
> > > > thinking
> > > >
> > > > about this quite a bit in the past.
> > > >
> > > > >
> > > >
> > > > *Approach 1 : Point integration with each framework *
> > > >
> > > > We may need a pure client module named for example
> > > > hoodie-client-core(common)
> > > > >> Then we could have: hoodie-client-spark, hoodie-client-flink
> > > > and hoodie-client-beam
> > > >
> > > > (+) This is the safest to do IMO, since we can isolate the current
> > > >
> > > > Spark
> > > >
> > > > execution (hoodie-spark, hoodie-client-spark) from the changes for
> > > >
> > > > flink,
> > > >
> > > > while it stabilizes over few releases.
> > > > (-) Downside is that the utilities needs to be redone :
> > > > hoodie-utilities-spark and hoodie-utilities-flink and
> > > > hoodie-utilities-core ? hoodie-cli?
> > > >
> > > > If we can do a comprehensive analysis of this model and come up
> > > >
> > > > with.
> > > >
> > > > means
> > > >
> > > > to refactor this cleanly, this would be promising.
> > > >
> > > > >
> > > >
> > > > *Approach 2: Beam as the compute abstraction*
> > > >
> > > > Another more drastic approach is to remove Spark as the compute
> > > >
> > > > abstraction
> > > >
> > > > for writing data and replace it with Beam.
> > > >
> > > > (+) All of the code remains more or less similar and there is one
> > > >
> > > > compute
> > > >
> > > > API to reason about.
> > > >
> > > > (-) The (very big) assumption here is that we are able to tune the
> > > >
> > > > spark
> > > >
> > > > runtime the same way using Beam : custom partitioners, support for
> > > >
> > > > all
> > > >
> > > > RDD
> > > >
> > > > operations we invoke, caching etc etc.
> > > > (-) It will be a massive rewrite and testing of such a large
> > > >
> > > > rewrite
> > > >
> > > > would
> > > >
> > > > also be really challenging, since we need to pay attention to all
> > > >
> > > > intricate
> > > >
> > > > details to ensure the spark users today experience no
> > > > regressions/side-effects
> > > > (-) Note that we still need to probably support the hoodie-spark
> > > >
> > > > module
> > > >
> > > > and
> > > >
> > > > may be a first-class such integration with flink, for native
> > > >
> > > > flink/spark
> > > >
> > > > pipeline authoring. Users of say DeltaStreamer need to pass in
> > > >
> > > > Spark
> > > >
> > > > or
> > > >
> > > > Flink configs anyway.. Developers need to think about
> > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end,
> > > >
> > > > this
> > > >
> > > > may
> > > >
> > > > not be the panacea that it seems to be.
> > > >
> > > > >
> > > > >
> > > >
> > > > One goal for the HIP is to get us all to agree as a community which
> > > >
> > > > one
> > > >
> > > > to
> > > >
> > > > pick, with sufficient investigation, testing, benchmarking..
> > > >
> > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang <[email protected]>
> > > >
> > > > wrote:
> > > >
> > > > +1 for both Beam and Flink
> > > >
> > > > First step here is to probably draw out current hierrarchy and
> > > >
> > > > figure
> > > >
> > > > out
> > > >
> > > > what the abstraction points are..
> > > > In my opinion, the runtime (spark, flink) should be done at the
> > > > hoodie-client level and just used by hoodie-utilties
> > > >
> > > > seamlessly..
> > > >
> > > > +1 for Vinoth's opinion, it should be the first step.
> > > >
> > > > No matter we hope Hudi to integrate with which computing
> > > >
> > > > framework.
> > > >
> > > > We need to decouple Hudi client and Spark.
> > > >
> > > > We may need a pure client module named for example
> > > > hoodie-client-core(common)
> > > >
> > > > Then we could have: hoodie-client-spark, hoodie-client-flink and
> > > > hoodie-client-beam
> > > >
> > > > Suneel Marthi <[email protected]> 于2019年8月4日周日 上午10:45写道：
> > > >
> > > > +1 for Beam -- agree with Semantic Beeng's analysis.
> > > >
> > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <
> > > >
> > > > [email protected]>
> > > >
> > > > wrote:
> > > >
> > > > So the way to go around this is that file a hip. Chalk all th
> > > >
> > > > classes
> > > >
> > > > our
> > > >
> > > > and start moving towards Pure client.
> > > >
> > > > Secondly should we want to try beam?
> > > >
> > > > I think there is to much going on here and I'm not able to
> > > >
> > > > follow.
> > > >
> > > > If
> > > >
> > > > we
> > > >
> > > > want to try out beam all along I don't think it makes sense
> > > >
> > > > to
> > > >
> > > > do
> > > >
> > > > anything
> > > >
> > > > on Flink then.
> > > >
> > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <
> > > >
> > > > [email protected]>
> > > >
> > > > wrote:
> > > >
> > > > >> +1 My money is on this approach.
> > > > >>
> > > > >> The existing abstractions from Beam seem enough for the use
> > > >
> > > > cases
> > > >
> > > > as I
> > > >
> > > > imagine them.
> > > >
> > > > >> Flink also has "dynamic table", "table source" and "table
> > > >
> > > > sink"
> > > >
> > > > which
> > > >
> > > > seem very useful abstractions where Hudi might fit nicely.
> > > >
> > > > >>
> > > > >>
> > > >
> > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > > >
> > > > >>
> > > > >> Attached a screen shot.
> > > > >>
> > > > >> This seems to fit with the original premise of Hudi as well.
> > > > >>
> > > > >> Am exploring this venue with a use case that involves
> > > >
> > > > "temporal
> > > >
> > > > joins
> > > >
> > > > on
> > > >
> > > > streams" which I need for feature extraction.
> > > >
> > > > >> Anyone is interested in this or has concrete enough needs
> > > >
> > > > and
> > > >
> > > > use
> > > >
> > > > cases
> > > >
> > > > please let me know.
> > > >
> > > > >> Best to go from an agreed upon set of 2-3 use cases.
> > > > >>
> > > > >> Cheers
> > > > >>
> > > > >> Nick
> > > > >>
> > > > >>
> > > > >> > Also, we do have some Beam experts on the mailing list..
> > > >
> > > > Can
> > > >
> > > > you
> > > >
> > > > please
> > > > >> weigh on viability of using Beam as the intermediate
> > > >
> > > > abstraction
> > > >
> > > > here
> > > >
> > > > between Spark/Flink?
> > > > Hudi uses RDD apis like groupBy, mapToPair,
> > > >
> > > > sortAndRepartition,
> > > >
> > > > reduceByKey, countByKey and also does custom partitioning a
> > > >
> > > > lot.>
> > > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to