zhangminglei created HUDI-1918:
----------------------------------

             Summary: Incorrect keyby field would cause serious data skew
                 Key: HUDI-1918
                 URL: https://issues.apache.org/jira/browse/HUDI-1918
             Project: Apache Hudi
          Issue Type: Bug
          Components: Flink Integration
            Reporter: zhangminglei
            Assignee: zhangminglei


The code 
(https://github.com/apache/hudi/blob/master/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java#L92),
 in the actual data warehouse, partition path is most based on log_date or 
log_hour, so keyBy (HoodieRecord: :getPartitionPath) that would cause serious 
data skew.

we can actually shuffle data by record key here, just like the pipeline in 
HoodieTableSink.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to