Re: [PR] feature(table-services): Allow users to not parallelize each partition with engine context during clustering planning [hudi]

via GitHub Thu, 19 Feb 2026 14:00:10 -0800


nsivabalan commented on code in PR #18191:
URL: https://github.com/apache/hudi/pull/18191#discussion_r2830354187



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/client/functional/TestHoodieClientOnCopyOnWriteStorage.java:
##########
@@ -1033,6 +1024,21 @@ public void testBinaryCopyClustering() throws Exception {
         false, SqlQueryEqualityPreCommitValidator.class.getName(), 
COUNT_SQL_QUERY_FOR_VALIDATION, "");
   }
 
+  /**
+   * Tests clustering when {@code hoodie.clustering.plan.partition.parallel} 
is disabled (non-default).
+   * Ensures the code path that computes clustering groups on the driver 
serially (HoodieLocalEngineContext) is exercised.
+   */
+  @ParameterizedTest
+  @MethodSource("populateMetaFieldsParams")
+  public void 
testClusteringWithComputePlanPerPartitionParallelDisabled(boolean 
populateMetaFields) throws Exception {
+    initMetaClient(getPropertiesForKeyGen(populateMetaFields));
+    HoodieClusteringConfig clusteringConfig = createClusteringBuilder(true, 1)
+        .planPartitionParallel(false)
+        .build();
+    testInsertAndClustering(clusteringConfig, populateMetaFields, true,

Review Comment:
   we need to validate that local engine context is used and not spark engine 
context. @suryaprasanna wrote a test in one of his patches recently to assert 
this. may be you can take inspiration from his patch



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "Please exercise caution while setting this config, especially 
when clustering is done very frequently. This could lead to race condition in "
           + "rare scenarios, for example, when the clustering completes after 
instants are fetched but before rollback completed.");
 
+  public static final ConfigProperty<Boolean> PLAN_PARTITION_PARALLEL = 
ConfigProperty

Review Comment:
   lets try to find a better name for this. 
   how about 
   `hoodie.clustering.plan.generation.use.local.engine.context`
   
   CC @yihua : can you think of a better naming here.
   



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/client/functional/TestHoodieClientOnCopyOnWriteStorage.java:
##########
@@ -1033,6 +1024,21 @@ public void testBinaryCopyClustering() throws Exception {
         false, SqlQueryEqualityPreCommitValidator.class.getName(), 
COUNT_SQL_QUERY_FOR_VALIDATION, "");
   }
 
+  /**
+   * Tests clustering when {@code hoodie.clustering.plan.partition.parallel} 
is disabled (non-default).
+   * Ensures the code path that computes clustering groups on the driver 
serially (HoodieLocalEngineContext) is exercised.
+   */
+  @ParameterizedTest
+  @MethodSource("populateMetaFieldsParams")
+  public void 
testClusteringWithComputePlanPerPartitionParallelDisabled(boolean 
populateMetaFields) throws Exception {

Review Comment:
   we don't need the parametrized. we can just go w/ OOB (populate meta fields 
true)



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/PartitionAwareClusteringPlanStrategy.java:
##########
@@ -164,7 +165,14 @@ public Option<HoodieClusteringPlan> 
generateClusteringPlan(ClusteringPlanActionE
       return Option.empty();
     }
 
-    List<Pair<List<HoodieClusteringGroup>, String>> res = 
getEngineContext().map(partitionPaths, partitionPath -> {
+    final HoodieEngineContext engineContext;
+    if (getWriteConfig().isClusteringPlanPartitionParallel()) {
+      engineContext = getEngineContext();
+    } else {
+      engineContext = new 
HoodieLocalEngineContext(getEngineContext().getStorageConf());

Review Comment:
   wondering if we can automatically do this for non partitioned tables as 
well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feature(table-services): Allow users to not parallelize each partition with engine context during clustering planning [hudi]

Reply via email to