+1 on approach 1. As pointed out approach 2 has a risk for performance 
regression when introducing beam abstraction. To keep things simpler and start 
iterating, we can try an incremental route where beam can be thought of another 
engine supporting Hudi. When there is material confidence that there wont be 
any performance regression with Beam, we can start the unification effort. This 
sounds to be a more practical approach too as folks who have volunteered to 
spearhead this project have Flink experience.
Vino/Taher/Vinay,
Being new to Flink, I would start with identifying if there are any high level 
differences in guarantees and programming model between spark and Flink. 
You can do a parallel investigation while we are deciding on the module 
structure.  You could be looking at all the patterns in Hudi's Spark APIs usage 
(RDD/DataSource/SparkContext) and see if such support can be achieved in theory 
with Flink. If not, what is the workaround. Documenting such patterns would be 
valuable when multiple engineers are working on it. For e:g, Hudi relies on     
(a) custom partitioning logic for upserts,     (b) caching RDDs to avoid reruns 
of costly stages     (c) A Spark upsert task knowing its spark 
partition/task/attempt ids

Balaji.V
    On Monday, August 5, 2019, 08:59:04 AM PDT, Vinoth Chandar 
<[email protected]> wrote:  
 
 Would like to highlight that there are two distinct approaches here with
different tradeoffs. Think of this as my braindump, as I have been thinking
about this quite a bit in the past.


*Approach 1 : Point integration with each framework *

>>We may need a pure client module named for example
hoodie-client-core(common)
>> Then we could have: hoodie-client-spark, hoodie-client-flink
and hoodie-client-beam

(+) This is the safest to do IMO, since we can isolate the current Spark
execution (hoodie-spark, hoodie-client-spark) from the changes for flink,
while it stabilizes over few releases.
(-) Downside is that the utilities needs to be redone :
 hoodie-utilities-spark and hoodie-utilities-flink and
hoodie-utilities-core ? hoodie-cli?

If we can do a comprehensive analysis of this model and come up with. means
to refactor this cleanly, this would be promising.


*Approach 2: Beam as the compute abstraction*

Another more drastic approach is to remove Spark as the compute abstraction
for writing data and replace it with Beam.

(+) All of the code remains more or less similar and there is one compute
API to reason about.

(-) The (very big) assumption here is that we are able to tune the spark
runtime the same way using Beam : custom partitioners, support for all RDD
operations we invoke, caching etc etc.
(-) It will be a massive rewrite and testing of such a large rewrite would
also be really challenging, since we need to pay attention to all intricate
details to ensure the spark users today experience no
regressions/side-effects
(-) Note that we still need to probably support the hoodie-spark module and
may be a first-class such integration with flink, for native flink/spark
pipeline authoring. Users of say DeltaStreamer need to pass in Spark or
Flink configs anyway..  Developers need to think about
what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may
not be the panacea that it seems to be.



One goal for the HIP is to get us all to agree as a community which one to
pick, with sufficient investigation, testing, benchmarking..

On Sat, Aug 3, 2019 at 7:56 PM vino yang <[email protected]> wrote:

> +1 for both Beam and Flink
>
> > First step here is to probably draw out current hierrarchy and figure out
> > what the abstraction points are..
> > In my opinion, the runtime (spark, flink) should be done at the
> > hoodie-client level and just used by hoodie-utilties seamlessly..
>
> +1 for Vinoth's opinion, it should be the first step.
>
> No matter we hope Hudi to integrate with which computing framework.
> We need to decouple Hudi client and Spark.
>
> We may need a pure client module named for example
> hoodie-client-core(common)
>
> Then we could have: hoodie-client-spark, hoodie-client-flink and
> hoodie-client-beam
>
> Suneel Marthi <[email protected]> 于2019年8月4日周日 上午10:45写道:
>
> > +1 for Beam -- agree with Semantic Beeng's analysis.
> >
> > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <[email protected]>
> > wrote:
> >
> > > So the way to go around this is that file a hip. Chalk all th classes
> our
> > > and start moving towards Pure client.
> > >
> > > Secondly should we want to try beam?
> > >
> > > I think there is to much going on here and I'm not able to follow. If
> we
> > > want to try out beam all along I don't think it makes sense to do
> > anything
> > > on Flink then.
> > >
> > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <[email protected]>
> > > wrote:
> > >
> > >> +1 My money is on this approach.
> > >>
> > >> The existing abstractions from Beam seem enough for the use cases as I
> > >> imagine them.
> > >>
> > >> Flink also has "dynamic table", "table source" and "table sink" which
> > >> seem very useful abstractions where Hudi might fit nicely.
> > >>
> > >>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html
> > >>
> > >>
> > >> Attached a screen shot.
> > >>
> > >> This seems to fit with the original premise of Hudi as well.
> > >>
> > >> Am exploring this venue with a use case that involves "temporal joins
> on
> > >> streams" which I need for feature extraction.
> > >>
> > >> Anyone is interested in this or has concrete enough needs and use
> cases
> > >> please let me know.
> > >>
> > >> Best to go from an agreed upon set of 2-3 use cases.
> > >>
> > >> Cheers
> > >>
> > >> Nick
> > >>
> > >>
> > >> > Also, we do have some Beam experts on the mailing list.. Can you
> > please
> > >> weigh on viability of using Beam as the intermediate abstraction here
> > >> between Spark/Flink?
> > >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition,
> > >> reduceByKey, countByKey and also does custom partitioning a lot.>
> > >>
> > >> >
> > >>
> > >
> >
>  

Reply via email to