Re: [PR] docs: Update documentation for new features in Hudi 1.2.0 [hudi]

via GitHub Thu, 28 May 2026 08:37:08 -0700


yihua commented on code in PR #18867:
URL: https://github.com/apache/hudi/pull/18867#discussion_r3318939182



##########
website/docs/clustering.md:
##########
@@ -134,6 +134,47 @@ dynamically expanding the buckets for bucket index 
datasets.
 :::note The latter two strategies are applicable only for the Spark engine.
 :::
 
+#### CommitBasedClusteringPlanStrategy
+
+Hudi 1.2.0 introduced 
`org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`,
 a plan
+strategy that schedules clustering based on commit patterns rather than just 
file size. It groups file slices by the
+commits that produced them, making it easier to cluster data written in 
specific time windows or under specific commit
+criteria.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.class` | 
`SparkSizeBasedClusteringPlanStrategy` | Set to 
`org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`
 to use commit-based planning. |
+| `hoodie.clustering.plan.strategy.earliest.commit.to.cluster` | (none) | 
Earliest commit time (exclusive) to start clustering from. Only commits after 
this instant are considered. Useful for incrementally clustering new data while 
skipping already-clustered history. |
+
+#### SparkStreamCopyClusteringPlanStrategy
+
+Available since Hudi 1.2.0, 
`org.apache.hudi.client.clustering.plan.strategy.SparkStreamCopyClusteringPlanStrategy`
+is a Spark-only plan strategy that performs binary file stitching (byte-level 
copy) rather than re-reading and
+re-writing records. This can be significantly faster when the goal is simply 
to coalesce small files and sort order is
+not required. It is paired with
+`org.apache.hudi.client.clustering.run.strategy.SparkStreamCopyClusteringExecutionStrategy`.
+
+#### Single-Group Clustering Control
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.single.group.clustering.enabled` | `true` | 
Whether to generate a clustering plan when only one file group is eligible. Set 
to `false` to skip clustering when there is nothing meaningful to consolidate 
(i.e., the partition already has a single file group). |
+
+#### File-Slice Sort Order in Plan Generation

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: Update documentation for new features in Hudi 1.2.0 [hudi]

Reply via email to