Alexey Kudinkin created HUDI-5685:
-------------------------------------

             Summary: Fix performance gap in Bulk Insert row-writing path with 
enabled de-duplication
                 Key: HUDI-5685
                 URL: https://issues.apache.org/jira/browse/HUDI-5685
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin
             Fix For: 0.13.0


Currently, in case flag {{hoodie.combine.before.insert}} is set to true and 
{{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row Writing 
performance will considerably degrade due to the following circumstances
 * During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD 
would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on 
{{(partition-path, record-key)}} into N partitions
 * In case {{BulkInsertSortMode.NONE}} is used as partitioner, no 
re-partitioning will be performed and therefore each Spark task might be 
writing into M table partitions
 * This in turn entails explosion in the number of created (small) files, 
killing performance and table's layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to