[
https://issues.apache.org/jira/browse/HUDI-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhangminglei updated HUDI-1918:
-------------------------------
Description:
The code
([https://github.com/apache/hudi/blob/master/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java#L92])
that in the actual data warehouse scenario, partition path is mostly based on
log_date or log_hour, so keyBy (HoodieRecord: :getPartitionPath) that would
cause serious data skew in a way.
we can actually shuffle data by record key here to avoid multiple subtasks
write to a bucket at the same time, just like the pipeline in HoodieTableSink.
was:
The code
([https://github.com/apache/hudi/blob/master/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java#L92])
that in the actual data warehouse, partition path is mostly based on log_date
or log_hour, so keyBy (HoodieRecord: :getPartitionPath) that would cause
serious data skew in a way.
we can actually shuffle data by record key here to avoid multiple subtasks
write to a bucket at the same time, just like the pipeline in HoodieTableSink.
> Incorrect keyby field would cause serious data skew
> ---------------------------------------------------
>
> Key: HUDI-1918
> URL: https://issues.apache.org/jira/browse/HUDI-1918
> Project: Apache Hudi
> Issue Type: Bug
> Components: Flink Integration
> Reporter: zhangminglei
> Assignee: zhangminglei
> Priority: Critical
> Labels: pull-request-available
>
> The code
> ([https://github.com/apache/hudi/blob/master/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java#L92])
> that in the actual data warehouse scenario, partition path is mostly based
> on log_date or log_hour, so keyBy (HoodieRecord: :getPartitionPath) that
> would cause serious data skew in a way.
> we can actually shuffle data by record key here to avoid multiple subtasks
> write to a bucket at the same time, just like the pipeline in HoodieTableSink.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)