Would like to highlight that there are two distinct approaches here with different tradeoffs. Think of this as my braindump, as I have been thinking about this quite a bit in the past.
*Approach 1 : Point integration with each framework * >>We may need a pure client module named for example hoodie-client-core(common) >> Then we could have: hoodie-client-spark, hoodie-client-flink and hoodie-client-beam (+) This is the safest to do IMO, since we can isolate the current Spark execution (hoodie-spark, hoodie-client-spark) from the changes for flink, while it stabilizes over few releases. (-) Downside is that the utilities needs to be redone : hoodie-utilities-spark and hoodie-utilities-flink and hoodie-utilities-core ? hoodie-cli? If we can do a comprehensive analysis of this model and come up with. means to refactor this cleanly, this would be promising. *Approach 2: Beam as the compute abstraction* Another more drastic approach is to remove Spark as the compute abstraction for writing data and replace it with Beam. (+) All of the code remains more or less similar and there is one compute API to reason about. (-) The (very big) assumption here is that we are able to tune the spark runtime the same way using Beam : custom partitioners, support for all RDD operations we invoke, caching etc etc. (-) It will be a massive rewrite and testing of such a large rewrite would also be really challenging, since we need to pay attention to all intricate details to ensure the spark users today experience no regressions/side-effects (-) Note that we still need to probably support the hoodie-spark module and may be a first-class such integration with flink, for native flink/spark pipeline authoring. Users of say DeltaStreamer need to pass in Spark or Flink configs anyway.. Developers need to think about what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may not be the panacea that it seems to be. One goal for the HIP is to get us all to agree as a community which one to pick, with sufficient investigation, testing, benchmarking.. On Sat, Aug 3, 2019 at 7:56 PM vino yang <yanghua1...@gmail.com> wrote: > +1 for both Beam and Flink > > > First step here is to probably draw out current hierrarchy and figure out > > what the abstraction points are.. > > In my opinion, the runtime (spark, flink) should be done at the > > hoodie-client level and just used by hoodie-utilties seamlessly.. > > +1 for Vinoth's opinion, it should be the first step. > > No matter we hope Hudi to integrate with which computing framework. > We need to decouple Hudi client and Spark. > > We may need a pure client module named for example > hoodie-client-core(common) > > Then we could have: hoodie-client-spark, hoodie-client-flink and > hoodie-client-beam > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 上午10:45写道: > > > +1 for Beam -- agree with Semantic Beeng's analysis. > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala <taher...@gmail.com> > > wrote: > > > > > So the way to go around this is that file a hip. Chalk all th classes > our > > > and start moving towards Pure client. > > > > > > Secondly should we want to try beam? > > > > > > I think there is to much going on here and I'm not able to follow. If > we > > > want to try out beam all along I don't think it makes sense to do > > anything > > > on Flink then. > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng <n...@semanticbeeng.com> > > > wrote: > > > > > >> +1 My money is on this approach. > > >> > > >> The existing abstractions from Beam seem enough for the use cases as I > > >> imagine them. > > >> > > >> Flink also has "dynamic table", "table source" and "table sink" which > > >> seem very useful abstractions where Hudi might fit nicely. > > >> > > >> > > >> > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html > > >> > > >> > > >> Attached a screen shot. > > >> > > >> This seems to fit with the original premise of Hudi as well. > > >> > > >> Am exploring this venue with a use case that involves "temporal joins > on > > >> streams" which I need for feature extraction. > > >> > > >> Anyone is interested in this or has concrete enough needs and use > cases > > >> please let me know. > > >> > > >> Best to go from an agreed upon set of 2-3 use cases. > > >> > > >> Cheers > > >> > > >> Nick > > >> > > >> > > >> > Also, we do have some Beam experts on the mailing list.. Can you > > please > > >> weigh on viability of using Beam as the intermediate abstraction here > > >> between Spark/Flink? > > >> Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition, > > >> reduceByKey, countByKey and also does custom partitioning a lot.> > > >> > > >> > > > >> > > > > > >