[jira] [Comment Edited] (HUDI-538) Restructuring hudi client module for multi engine support

Vinoth Chandar (Jira) Sat, 18 Jan 2020 10:06:06 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018664#comment-17018664
 ]


Vinoth Chandar edited comment on HUDI-538 at 1/18/20 6:05 PM:
--------------------------------------------------------------

+1 [~yanghua] , I added a second task for moving classes around based on your 
changes..

Core issue we need a solution for IMO is the following .. (if we solve this, 
rest is more or less easy)  I will illustrate using Spark (since my 
understanding of Flink is somewhat limited atm) ..

 

So,  even for Spark I would like the writing to be done via _RDD_ or 
_DataFrame_ routes and the current code converts the dataframe into RDDs to 
perform writes. This has some performance side-effects (suprisingly, :P) 

 

1) If you take a single class like _HoodieWriteClient_, then it currently does 
something like `hoodieRecordRDD.map().sort()` internally.. if we want to 
support Flink DataStream or Spark DataFrame as the object, then we need to 
somehow define an abstraction like `HoodieExecutionContext<T>`  which will have 
a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition() methods? 
There will be subclasses like _HoodieSparkRDDExecutionContext<JavaRDD>,_ 
_HoodieSparkDataFrameExecutionContext<DataFrame>_, 
_HoodieFlinkDataStreamExecutionContext<DataStream>_ which will implement them 
in engine specific ways and hand back the transformed T object? 

 

2) Right now, we work with _HoodieRecord_, as the record level abstraction.. 
i.e we eagerly parse the input into a HoodieKey (String recordKey, String 
partitionPath) and HoodieRecordPayload. The key is needed during indexing, and 
the payload is needed to precombine duplicates within a batch (may be spark 
specific)/combine incoming record with whats stored in the table during 
writing.. We need a way to do these lazily by pushing the key extraction 
function into the entire writing path. 

 

I think we should deeply think about these issues.. have concrete approaches 
before we embark more deeply.. We will hit these issues.. 

 

 

 


was (Author: vc):
+1 [~yanghua] , I added a second task for moving classes around based on your 
changes..

Core issue we need a solution for IMO is the following .. (if we solve this, 
rest is more or less easy)  I will illustrate using Spark (since my 
understanding of Flink is somewhat limited atm) ..

 

So,  even for Spark I would like the writing to be done via _RDD_ or 
_DataFrame_ routes and the current code converts the dataframe into RDDs to 
perform writes. This has some performance side-effects (suprisingly, :P) 

 

1) If you take a single class like _HoodieWriteClient_, then it currently does 
something like `hoodieRecordRDD.map().sort()` internally.. if we want to 
support Flink DataStream or Spark DataFrame as the object, then we need to 
somehow define an abstraction like `HoodieExecutionContext<T>`  which will have 
a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition() methods? 
There will be subclasses like _HoodieSparkRDDExecutionContext<JavaRDD>,_ 
_HoodieSparkDataFrameExecutionContext<DataFrame>_, 
_HoodieFlinkDataStreamExecutionContext<DataStream>_ which will implement them 
in engine specific ways and hand back the transformed T object? 

 

2) Right now, we work with _HoodieRecord_, as the record level abstraction.. 
i.e we eagerly parse the input into a HoodieKey (String recordKey, String 
partitionPath) and HoodieRecordPayload. The key is needed during indexing, and 
the payload is needed to precombine duplicates within a batch (may be spark 
specific)/combine incoming record with whats stored in the table during 
writing.. We need a way to do these lazily by pushing the key extraction 
function into the entire writing path. 

 

I think we should deeply think about these issues.. have concrete approaches 
before we embark more deeply.. We will hit these issues.. 

 

 

 

 

 

 

 

 

 

> Restructuring hudi client module for multi engine support
> ---------------------------------------------------------
>
>                 Key: HUDI-538
>                 URL: https://issues.apache.org/jira/browse/HUDI-538
>             Project: Apache Hudi (incubating)
>          Issue Type: Wish
>          Components: Code Cleanup
>            Reporter: vinoyang
>            Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-538) Restructuring hudi client module for multi engine support

Reply via email to