[
https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
kwang updated HUDI-6990:
------------------------
Component/s: clustering
> Spark clustering job reads records support control the parallelism
> ------------------------------------------------------------------
>
> Key: HUDI-6990
> URL: https://issues.apache.org/jira/browse/HUDI-6990
> Project: Apache Hudi
> Issue Type: Improvement
> Components: clustering
> Reporter: kwang
> Priority: Major
> Attachments: after-subtasks.png, before-subtasks.png
>
>
> Spark executes clustering job will read clustering plan which contains
> multiple groups. Each group process many base files or log files. When we
> config param `
> hoodie.clustering.plan.strategy.sort.columns`, we read those files through
> spark's parallelize method, every file read will generate one sub task. It's
> unreasonable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)