Jing Zhang created HUDI-7111:
--------------------------------
Summary: Performance regression of spark job which written into
simple bucket index table
Key: HUDI-7111
URL: https://issues.apache.org/jira/browse/HUDI-7111
Project: Apache Hudi
Issue Type: Improvement
Components: spark
Reporter: Jing Zhang
Attachments: image-2023-11-16-23-41-32-729.png
After upgrade the version to 0.14.0, the performance of the Spark job, which is
written into a simple bucket index table, is regressing.
!image-2023-11-16-23-41-32-729.png!
The reason is in the [PR#4480|https://github.com/apache/hudi/pull/4480], the
refactor of bucket index introduce two unnecessary stages for simple bucket
index.
{code:java}
List<String> partitions =
records.map(HoodieRecord::getPartitionPath).distinct().collectAsList();
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)