[
https://issues.apache.org/jira/browse/HUDI-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jing Zhang updated HUDI-7111:
-----------------------------
Description:
After upgrade the version to 0.14.0, the performance of the Spark job, which is
written into a simple bucket index table, is regressing.
!image-2023-11-16-23-41-32-729.png!
The reason is in the [PR#4480|https://github.com/apache/hudi/pull/4480], the
refactor of bucket index introduce two unnecessary stages in tag for simple
bucket index.
{code:java}
List<String> partitions =
records.map(HoodieRecord::getPartitionPath).distinct().collectAsList();
{code}
was:
After upgrade the version to 0.14.0, the performance of the Spark job, which is
written into a simple bucket index table, is regressing.
!image-2023-11-16-23-41-32-729.png!
The reason is in the [PR#4480|https://github.com/apache/hudi/pull/4480], the
refactor of bucket index introduce two unnecessary stages for simple bucket
index.
{code:java}
List<String> partitions =
records.map(HoodieRecord::getPartitionPath).distinct().collectAsList();
{code}
> Performance regression of spark job which written into simple bucket index
> table
> --------------------------------------------------------------------------------
>
> Key: HUDI-7111
> URL: https://issues.apache.org/jira/browse/HUDI-7111
> Project: Apache Hudi
> Issue Type: Improvement
> Components: spark
> Reporter: Jing Zhang
> Priority: Major
> Attachments: image-2023-11-16-23-41-32-729.png
>
>
> After upgrade the version to 0.14.0, the performance of the Spark job, which
> is written into a simple bucket index table, is regressing.
> !image-2023-11-16-23-41-32-729.png!
> The reason is in the [PR#4480|https://github.com/apache/hudi/pull/4480], the
> refactor of bucket index introduce two unnecessary stages in tag for simple
> bucket index.
> {code:java}
> List<String> partitions =
> records.map(HoodieRecord::getPartitionPath).distinct().collectAsList();
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)