yihua commented on code in PR #18191:
URL: https://github.com/apache/hudi/pull/18191#discussion_r2802518116
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
+ "Please exercise caution while setting this config, especially
when clustering is done very frequently. This could lead to race condition in "
+ "rare scenarios, for example, when the clustering completes after
instants are fetched but before rollback completed.");
+ public static final ConfigProperty<Boolean>
COMPUTE_PLAN_PER_PARTITION_PARALLEL = ConfigProperty
+ .key("hoodie.clustering.plan.compute.partition.parallel")
+ .defaultValue(true)
+ .sinceVersion("1.1.0")
Review Comment:
```suggestion
.sinceVersion("1.2.0")
```
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
+ "Please exercise caution while setting this config, especially
when clustering is done very frequently. This could lead to race condition in "
+ "rare scenarios, for example, when the clustering completes after
instants are fetched but before rollback completed.");
+ public static final ConfigProperty<Boolean>
COMPUTE_PLAN_PER_PARTITION_PARALLEL = ConfigProperty
+ .key("hoodie.clustering.plan.compute.partition.parallel")
+ .defaultValue(true)
+ .sinceVersion("1.1.0")
+ .withDocumentation("Compute clustering groups for each partition
independently in parallel (using engine context) when generating "
+ + "clustering plan. For example, with Spark engine setting this to
true would lead to a separate spark task computing all clustering "
+ + "groups per dataset partition, whereas setting this to false would
lead to all clustering groups for all partitions being serially "
+ + "computed on the driver. By default this should be true, but can
be disabled for cases where there are guaranteed to only be "
+ + "a few partitions with many files in clustering plan. And (when
using Spark) it would be more resource-efficient to just use local "
+ + "engine contexthave the spark driver (with extra memory) compute
it all rather than allocate more memory per executor "
Review Comment:
nit: looks like there's a typo/run-on in the doc string, `"engine
contexthave the spark driver"` should probably be `"engine context to have the
spark driver"` or similar.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
+ "Please exercise caution while setting this config, especially
when clustering is done very frequently. This could lead to race condition in "
+ "rare scenarios, for example, when the clustering completes after
instants are fetched but before rollback completed.");
+ public static final ConfigProperty<Boolean>
COMPUTE_PLAN_PER_PARTITION_PARALLEL = ConfigProperty
+ .key("hoodie.clustering.plan.compute.partition.parallel")
Review Comment:
Simplify with `hoodie.clustering.plan.partition.parallel` and
`PLAN_PARTITION_PARALLEL`?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -1919,6 +1915,10 @@ public boolean isBinaryCopySchemaEvolutionEnabled() {
return
getBooleanOrDefault(HoodieClusteringConfig.FILE_STITCHING_BINARY_COPY_SCHEMA_EVOLUTION_ENABLE);
}
+ public boolean isClusteringComputePlanPerPartitionParallel() {
Review Comment:
nit: rename accordingly
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -347,6 +347,18 @@ public class HoodieClusteringConfig extends HoodieConfig {
+ "Please exercise caution while setting this config, especially
when clustering is done very frequently. This could lead to race condition in "
+ "rare scenarios, for example, when the clustering completes after
instants are fetched but before rollback completed.");
+ public static final ConfigProperty<Boolean>
COMPUTE_PLAN_PER_PARTITION_PARALLEL = ConfigProperty
+ .key("hoodie.clustering.plan.compute.partition.parallel")
+ .defaultValue(true)
+ .sinceVersion("1.1.0")
+ .withDocumentation("Compute clustering groups for each partition
independently in parallel (using engine context) when generating "
+ + "clustering plan. For example, with Spark engine setting this to
true would lead to a separate spark task computing all clustering "
Review Comment:
nit: the doc says "serially computed on the driver", but
`HoodieLocalEngineContext.map()` uses `data.stream().parallel()`, so it's still
parallel, just local (no Spark tasks). Might be worth clarifying to avoid
confusion.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -630,6 +642,11 @@ public Builder
withFileStitchingBinaryCopySchemaEvolutionEnabled(Boolean enabled
return this;
}
+ public Builder clusteringComputePlanPerPartitionParallel(Boolean parallel)
{
+ clusteringConfig.setValue(COMPUTE_PLAN_PER_PARTITION_PARALLEL,
String.valueOf(parallel));
+ return this;
Review Comment:
```suggestion
public Builder clusteringPlanPartitionParallel(Boolean isParallel) {
clusteringConfig.setValue(COMPUTE_PLAN_PER_PARTITION_PARALLEL,
String.valueOf(isParallel));
return this;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]