Hello Semantic, I think that refactoring the integration module by plug-in way first needs to extract the abstract interface.As you said, DSL is the appropriate way to integrate the flow of Spark and Flink, but the common interface is necessary for low-level implementation. By the way, the gitlinks will be provided when I complete the Flink intergration.
Thanks, Nicholas At 2019-11-27 15:09:58, "Semantic Beeng" <[email protected]> wrote: Hello Nicholas --- Sorry above mistakenly using the other email! :-( --- Indeed, also expected that aligning the (quite different) programming models of the two frameworks will be quite difficult. I made some points and suggestions a while ago if you care to search the mailing list archive (mentioned Quill because it is a DSL over Spark). We can also try functional programing abstractions like free monads and kleisli to create a DSL for the generic flow of Hudi and Spark / Flink specific implementations. Please have a look maybe you can inspiration and advise if you feel this might be worth talking more about (in Scala FP) https://www.stephenzoio.com/creating-composable-data-pipelines-spark/ http://www.evernote.com/l/AYlyRBs4mG1DnblJzhkFJomfHKcScKN-0po/ http://www.evernote.com/l/AK96usP8vWJGPZIoMD4J_pphvo2R01B-Fa8/ http://www.evernote.com/l/AK8N_hFJStlB26xXOw4yHA1cOaQ7Nnb_Zfk/ https://softwaremill.com/free-monads/ Also, please give some gitlinks (see Intellij plugin for this) to the code refactorings in progress. Ideally you'd create a wiki page to discuss: trying to crack such a tough issue by email is futile (I think). Hope it helps. Cheers Nick On November 27, 2019 at 1:04 AM 蒋晓峰 < [email protected]> wrote: Hi guys, Feeling the pain of supporting Flink engine for Hudi, it is necessary to discuss the design of high cohesion, low coupling, and plug-in for the calculation engine module here. Now Hudi's design, in order to highlight its core components, is a patchwork of the Spark RDD API mixed with business logic scattered in multiple modules and various types of methods. As a result, developers with a background in computing engines have difficulty understanding the main process of Spark job, and the calculation engine plug-in is also more difficult, because the general interface carries the context of RDD and Spark, unless large-scale restructuring is started. In my opinion, it is necessary to refactor the Hudi integration module through plug-inization to facilitate the subsequent integration of Spark and FLink. Best, Nicholas
