Re: [PR] [HUDI-6990] Configurable clustering task parallelism [hudi]

via GitHub Fri, 27 Oct 2023 04:41:08 -0700


ksmou commented on code in PR #9925:
URL: https://github.com/apache/hudi/pull/9925#discussion_r1374450011



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "value will let the clustering job run faster, while it will give 
additional pressure to the "
           + "execution engines to manage more concurrent running jobs.");
 
+  public static final ConfigProperty<Integer> 
CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty
+      .key("hoodie.clustering.read.records.parallelism")
+      .defaultValue(20)

Review Comment:
   > We already have a `hoodie.clustering.max.parallelism` to control how many 
clustering jobs to submit, now this param looks to control the parallelism when 
reading per group, and it only works when row writer is disabled. This 
configuration name still confuses me.
   > 
   > Is there any other way to avoid a new configuration? What abt we directly 
use `clusteringGroup.getNumOutputFileGroups`
   
   Yes. it limits the read parallelism for single clutering group. 
`clusteringGroup.getNumOutputFileGroups` is 2 default, it's too small to read a 
1g file.
   
   Maybe we can use `hoodie.clustering.rdd.read.parallelism`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-6990] Configurable clustering task parallelism [hudi]

Reply via email to