Decoupling Spark and Hudi is the first step to bring in a Flink runtime, and its also the hardest part.
On the decoupling itself, the IOHandle classes are (almost) unaware of Spark itself, where the Write/ReadClient and the Table classes are very aware.. First step here is to probably draw out current hierrarchy and figure out what the abstraction points are.. In my opinion, the runtime (spark, flink) should be done at the hoodie-client level and just used by hoodie-utilties seamlessly.. My 2c for folks working on this is to may be pick up few bugs/issues across these areas to get more familiarity with code and then draw up the proposals.. (not a requirement, but will build more understanding of all devils-in-the-details) >>Not sure if this requires a HIP to drive. I think this definitely needs a HIP. Its a large enough change :) Also, we do have some Beam experts on the mailing list.. Can you please weigh on viability of using Beam as the intermediate abstraction here between Spark/Flink? Hudi uses RDD apis like groupBy, mapToPair, sortAndRepartition, reduceByKey, countByKey and also does custom partitioning a lot. On Fri, Aug 2, 2019 at 9:46 AM Aaron Langford <[email protected]> wrote: > More for my own edification, how does the recently introduced timeline > service play into the delta writer components? > > On Fri, Aug 2, 2019 at 7:53 AM vino yang <[email protected]> wrote: > > > Hi Suneel, > > > > Thank you for your suggestion, let me clarify. > > > > > > *The context of this email is that we are evaluating how to implement a > > Stream Delta writer base on Flink.* > > About the discussion between me, Taher and Vinay, those are just some > > trivial details in the preparation of the document, and the discussion is > > also based on mail. > > > > When we don't have the first draft, discussing the details on the mailing > > list may confuse others and easily deviate from the topic. Our initial > plan > > was to facilitate community discussions and reviews when we had a draft > of > > the documentation available to the community. > > > > Best, > > Vino > > > > Suneel Marthi <[email protected]> 于2019年8月2日周五 下午10:37写道: > > > > > Please keep all discussions to Mailing lists here - no offline > > discussions > > > please. > > > > > > On Fri, Aug 2, 2019 at 10:22 AM vino yang <[email protected]> > wrote: > > > > > > > Hi guys, > > > > > > > > Currently, I, Taher and Vinay are working on issue HUDI-184.[1] > > > > > > > > As a first step, we are discussing the design doc. > > > > > > > > After diving into the code, We listed some relevant classes about the > > > Spark > > > > delta writer. > > > > > > > > - module: hoodie-utilities > > > > > > > > com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer > > > > com.uber.hoodie.utilities.deltastreamer.DeltaSyncService > > > > com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter > > > > com.uber.hoodie.utilities.schema.SchemaProvider > > > > com.uber.hoodie.utilities.transform.Transformer > > > > > > > > - module: hoodie-client > > > > > > > > com.uber.hoodie.HoodieWriteClient (to commit compaction) > > > > > > > > > > > > The fact is *hoodie-utilities* depends on *hoodie-client*, however, > > > > *hoodie-client* is also not a pure Hudi component, it also depends on > > > Spark > > > > lib. > > > > > > > > So I propose hoodie should provide a pure hoodie-client and decouple > > with > > > > Spark. Then Flink and Spark modules should depend on it. > > > > > > > > Moreover, based on the old discussion[2], we all agree that Spark is > > not > > > > the only choice for Hudi, it could also be Flink/Beam. > > > > > > > > IMO, We should decouple Hudi from Spark at the height of the project, > > > > including but not limited to module splitting and renaming. > > > > > > > > Not sure if this requires a HIP to drive. > > > > > > > > We should first listen to the opinions of the community. Any ideas > and > > > > suggestions are welcome and appreciated. > > > > > > > > Best, > > > > Vino > > > > > > > > [1]: https://issues.apache.org/jira/browse/HUDI-184?filter=-1 > > > > [2]: > > > > > > > > > > > > > > https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E > > > > > > > > > >
