Hudi integration module through plug-inization

蒋晓峰 Tue, 26 Nov 2019 23:23:40 -0800

Hello Semantic，
  I think that refactoring the integration module by plug-in way first needs to 
extract the abstract interface.As you said, DSL is the appropriate way to 
integrate the flow of Spark and Flink, but the common interface is necessary 
for low-level implementation.
 By the way, the gitlinks will be provided when I complete the Flink 
intergration.




Thanks,
Nicholas


At 2019-11-27 15:09:58, "Semantic Beeng" <[email protected]> wrote:

Hello Nicholas


 --- Sorry above mistakenly using the other email! :-( ---


Indeed, also expected that aligning the (quite different) programming models of 
the two frameworks will be quite difficult.


I made some points and suggestions a while ago if you care to search the 
mailing list archive (mentioned Quill because it is a DSL over Spark).


We can also try functional programing abstractions like free monads and kleisli 
to create a DSL for the generic flow of Hudi and Spark / Flink specific 
implementations.


Please have a look maybe you can inspiration and advise if you feel this might 
be worth talking more about (in Scala FP)
https://www.stephenzoio.com/creating-composable-data-pipelines-spark/
http://www.evernote.com/l/AYlyRBs4mG1DnblJzhkFJomfHKcScKN-0po/
http://www.evernote.com/l/AK96usP8vWJGPZIoMD4J_pphvo2R01B-Fa8/
http://www.evernote.com/l/AK8N_hFJStlB26xXOw4yHA1cOaQ7Nnb_Zfk/
https://softwaremill.com/free-monads/


Also, please give some gitlinks (see Intellij plugin for this) to the code 
refactorings in progress.



Ideally you'd create a wiki page to discuss: trying to crack such a tough issue 
by email is futile (I think).


Hope it helps.
Cheers
Nick

On November 27, 2019 at 1:04 AM 蒋晓峰 < [email protected]> wrote:




Hi guys,




Feeling the pain of supporting Flink engine for Hudi, it is necessary to 
discuss the design of high cohesion, low coupling, and plug-in for the 
calculation engine module here.




Now Hudi's design, in order to highlight its core components, is a patchwork of 
the Spark RDD API mixed with business logic scattered in multiple modules and 
various types of methods. As a result, developers with a background in 
computing engines have difficulty understanding the main process of Spark job, 
and the calculation engine plug-in is also more difficult, because the general 
interface carries the context of RDD and Spark, unless large-scale 
restructuring is started.




In my opinion, it is necessary to refactor the Hudi integration module through 
plug-inization to facilitate the subsequent integration of Spark and FLink.




Best,
Nicholas

Hudi integration module through plug-inization

Reply via email to