Re: [PR] docs: Update documentation for new features in Hudi 1.2.0 [hudi]

via GitHub Thu, 28 May 2026 00:03:26 -0700


nsivabalan commented on code in PR #18867:
URL: https://github.com/apache/hudi/pull/18867#discussion_r3315761518



##########
website/docs/clustering.md:
##########
@@ -134,6 +134,47 @@ dynamically expanding the buckets for bucket index 
datasets.
 :::note The latter two strategies are applicable only for the Spark engine.
 :::
 
+#### CommitBasedClusteringPlanStrategy
+
+Hudi 1.2.0 introduced 
`org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`,
 a plan
+strategy that schedules clustering based on commit patterns rather than just 
file size. It groups file slices by the
+commits that produced them, making it easier to cluster data written in 
specific time windows or under specific commit
+criteria.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.class` | 
`SparkSizeBasedClusteringPlanStrategy` | Set to 
`org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`
 to use commit-based planning. |
+| `hoodie.clustering.plan.strategy.earliest.commit.to.cluster` | (none) | 
Earliest commit time (exclusive) to start clustering from. Only commits after 
this instant are considered. Useful for incrementally clustering new data while 
skipping already-clustered history. |
+
+#### SparkStreamCopyClusteringPlanStrategy
+
+Available since Hudi 1.2.0, 
`org.apache.hudi.client.clustering.plan.strategy.SparkStreamCopyClusteringPlanStrategy`
+is a Spark-only plan strategy that performs binary file stitching (byte-level 
copy) rather than re-reading and
+re-writing records. This can be significantly faster when the goal is simply 
to coalesce small files and sort order is
+not required. It is paired with
+`org.apache.hudi.client.clustering.run.strategy.SparkStreamCopyClusteringExecutionStrategy`.
+
+#### Single-Group Clustering Control
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.single.group.clustering.enabled` | `true` | 
Whether to generate a clustering plan when only one file group is eligible. Set 
to `false` to skip clustering when there is nothing meaningful to consolidate 
(i.e., the partition already has a single file group). |
+
+#### File-Slice Sort Order in Plan Generation

Review Comment:
   minor. 
   ```
   File-Slice Sort Order in Clustering Plan Generation
   ```
   



##########
website/docs/ingestion_flink.md:
##########
@@ -185,7 +194,7 @@ Hudi Flink writer supports two types of writer indexes:
 | Cross‑Partition Changes | Cannot handle changes among partitions (unless 
input is a CDC stream)                                                          
                                                                                
                                                             | No limit on 
handling cross‑partition changes                                                
                           |
 
 :::note
-Bucket index supports only the `UPSERT` write operation and cannot be used 
with the [append mode](#append-mode) in Flink.
+Bucket index supports `UPSERT` write operations on both COW and MOR tables. As 
of Hudi 1.2.0, MOR + bucket index + upsert is fully supported. Bucket index 
cannot be used with the [append mode](#append-mode) in Flink.

Review Comment:
   hey @danny0405 : is there anything to call out in our release docs towards 
this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: Update documentation for new features in Hudi 1.2.0 [hudi]

Reply via email to