Alexey Kudinkin created HUDI-5685:
-------------------------------------
Summary: Fix performance gap in Bulk Insert row-writing path with
enabled de-duplication
Key: HUDI-5685
URL: https://issues.apache.org/jira/browse/HUDI-5685
Project: Apache Hudi
Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
Fix For: 0.13.0
Currently, in case flag {{hoodie.combine.before.insert}} is set to true and
{{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row Writing
performance will considerably degrade due to the following circumstances
* During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD
would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on
{{(partition-path, record-key)}} into N partitions
* In case {{BulkInsertSortMode.NONE}} is used as partitioner, no
re-partitioning will be performed and therefore each Spark task might be
writing into M table partitions
* This in turn entails explosion in the number of created (small) files,
killing performance and table's layout
--
This message was sent by Atlassian Jira
(v8.20.10#820010)