Re: [PR] feature(table-services): Allow users to not parallelize each partition with engine context during clustering planning [hudi]

via GitHub Thu, 12 Feb 2026 22:39:27 -0800


yihua commented on code in PR #18191:
URL: https://github.com/apache/hudi/pull/18191#discussion_r2802518116



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "Please exercise caution while setting this config, especially 
when clustering is done very frequently. This could lead to race condition in "
           + "rare scenarios, for example, when the clustering completes after 
instants are fetched but before rollback completed.");
 
+  public static final ConfigProperty<Boolean> 
COMPUTE_PLAN_PER_PARTITION_PARALLEL = ConfigProperty
+      .key("hoodie.clustering.plan.compute.partition.parallel")
+      .defaultValue(true)
+      .sinceVersion("1.1.0")

Review Comment:
   ```suggestion
         .sinceVersion("1.2.0")
   ```



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "Please exercise caution while setting this config, especially 
when clustering is done very frequently. This could lead to race condition in "
           + "rare scenarios, for example, when the clustering completes after 
instants are fetched but before rollback completed.");
 
+  public static final ConfigProperty<Boolean> 
COMPUTE_PLAN_PER_PARTITION_PARALLEL = ConfigProperty
+      .key("hoodie.clustering.plan.compute.partition.parallel")
+      .defaultValue(true)
+      .sinceVersion("1.1.0")
+      .withDocumentation("Compute clustering groups for each partition 
independently in parallel (using engine context) when generating "
+          + "clustering plan. For example, with Spark engine setting this to 
true would lead to a separate spark task computing all clustering "
+          + "groups per dataset partition, whereas setting this to false would 
lead to all clustering groups for all partitions being serially "
+          + "computed on the driver. By default this should be true, but can 
be disabled for cases where there are guaranteed to only be "
+          + "a few partitions with many files in clustering plan. And (when 
using Spark) it would be more resource-efficient to just use local " 
+          + "engine contexthave the spark driver (with extra memory) compute 
it all rather than allocate more memory per executor "

Review Comment:
   nit: looks like there's a typo/run-on in the doc string, `"engine 
contexthave the spark driver"` should probably be `"engine context to have the 
spark driver"` or similar.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "Please exercise caution while setting this config, especially 
when clustering is done very frequently. This could lead to race condition in "
           + "rare scenarios, for example, when the clustering completes after 
instants are fetched but before rollback completed.");
 
+  public static final ConfigProperty<Boolean> 
COMPUTE_PLAN_PER_PARTITION_PARALLEL = ConfigProperty
+      .key("hoodie.clustering.plan.compute.partition.parallel")

Review Comment:
   Simplify with `hoodie.clustering.plan.partition.parallel` and 
`PLAN_PARTITION_PARALLEL`?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -1919,6 +1915,10 @@ public boolean isBinaryCopySchemaEvolutionEnabled() {
     return 
getBooleanOrDefault(HoodieClusteringConfig.FILE_STITCHING_BINARY_COPY_SCHEMA_EVOLUTION_ENABLE);
   }
 
+  public boolean isClusteringComputePlanPerPartitionParallel() {

Review Comment:
   nit: rename accordingly



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "Please exercise caution while setting this config, especially 
when clustering is done very frequently. This could lead to race condition in "
           + "rare scenarios, for example, when the clustering completes after 
instants are fetched but before rollback completed.");
 
+  public static final ConfigProperty<Boolean> 
COMPUTE_PLAN_PER_PARTITION_PARALLEL = ConfigProperty
+      .key("hoodie.clustering.plan.compute.partition.parallel")
+      .defaultValue(true)
+      .sinceVersion("1.1.0")
+      .withDocumentation("Compute clustering groups for each partition 
independently in parallel (using engine context) when generating "
+          + "clustering plan. For example, with Spark engine setting this to 
true would lead to a separate spark task computing all clustering "

Review Comment:
   nit: the doc says "serially computed on the driver", but 
`HoodieLocalEngineContext.map()` uses `data.stream().parallel()`, so it's still 
parallel, just local (no Spark tasks). Might be worth clarifying to avoid 
confusion.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -630,6 +642,11 @@ public Builder 
withFileStitchingBinaryCopySchemaEvolutionEnabled(Boolean enabled
       return this;
     }
 
+    public Builder clusteringComputePlanPerPartitionParallel(Boolean parallel) 
{
+      clusteringConfig.setValue(COMPUTE_PLAN_PER_PARTITION_PARALLEL, 
String.valueOf(parallel));
+      return this;

Review Comment:
   ```suggestion
       public Builder clusteringPlanPartitionParallel(Boolean isParallel) {
         clusteringConfig.setValue(COMPUTE_PLAN_PER_PARTITION_PARALLEL, 
String.valueOf(isParallel));
         return this;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feature(table-services): Allow users to not parallelize each partition with engine context during clustering planning [hudi]

Reply via email to