[ 
https://issues.apache.org/jira/browse/HUDI-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangminglei updated HUDI-1918:
-------------------------------
    Description: 
The code 
([https://github.com/apache/hudi/blob/master/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java#L92])
 that in the actual data warehouse scenario, keyBy (HoodieRecord: 
:getPartitionPath) that would cause serious data skew in a way.

we can actually shuffle data by record key here to avoid multiple subtasks 
write to a bucket at the same time, just like the pipeline in HoodieTableSink.

  was:
The code 
([https://github.com/apache/hudi/blob/master/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java#L92])
 that in the actual data warehouse scenario, partition path is mostly based on 
log_date or log_hour, so keyBy (HoodieRecord: :getPartitionPath) that would 
cause serious data skew in a way.

we can actually shuffle data by record key here to avoid multiple subtasks 
write to a bucket at the same time, just like the pipeline in HoodieTableSink.


> Incorrect keyby field would cause serious data skew
> ---------------------------------------------------
>
>                 Key: HUDI-1918
>                 URL: https://issues.apache.org/jira/browse/HUDI-1918
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Flink Integration
>            Reporter: zhangminglei
>            Assignee: zhangminglei
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> The code 
> ([https://github.com/apache/hudi/blob/master/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java#L92])
>  that in the actual data warehouse scenario, keyBy (HoodieRecord: 
> :getPartitionPath) that would cause serious data skew in a way.
> we can actually shuffle data by record key here to avoid multiple subtasks 
> write to a bucket at the same time, just like the pipeline in HoodieTableSink.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to