[
https://issues.apache.org/jira/browse/HUDI-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
kwang updated HUDI-6990:
------------------------
Description:
Spark executes clustering job will read clustering plan which contains multiple
groups. Each group process many base files or log files. When we config param `
hoodie.clustering.plan.strategy.sort.columns`, we read those files through
spark's parallelize method, every file read will generate one sub task. It's
unreasonable.
was:Spark executes clustering job will read clustering plan which contains
multiple groups. Each group process many base files or log files. When we read
those files through spark's parallelize method, every file will generate one
sub task. It's unreasonable.
> Spark clustering job reads records support control the parallelism
> ------------------------------------------------------------------
>
> Key: HUDI-6990
> URL: https://issues.apache.org/jira/browse/HUDI-6990
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: kwang
> Priority: Major
>
> Spark executes clustering job will read clustering plan which contains
> multiple groups. Each group process many base files or log files. When we
> config param `
> hoodie.clustering.plan.strategy.sort.columns`, we read those files through
> spark's parallelize method, every file read will generate one sub task. It's
> unreasonable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)