nsivabalan commented on code in PR #18867:
URL: https://github.com/apache/hudi/pull/18867#discussion_r3315761518
##########
website/docs/clustering.md:
##########
@@ -134,6 +134,47 @@ dynamically expanding the buckets for bucket index
datasets.
:::note The latter two strategies are applicable only for the Spark engine.
:::
+#### CommitBasedClusteringPlanStrategy
+
+Hudi 1.2.0 introduced
`org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`,
a plan
+strategy that schedules clustering based on commit patterns rather than just
file size. It groups file slices by the
+commits that produced them, making it easier to cluster data written in
specific time windows or under specific commit
+criteria.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.class` |
`SparkSizeBasedClusteringPlanStrategy` | Set to
`org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`
to use commit-based planning. |
+| `hoodie.clustering.plan.strategy.earliest.commit.to.cluster` | (none) |
Earliest commit time (exclusive) to start clustering from. Only commits after
this instant are considered. Useful for incrementally clustering new data while
skipping already-clustered history. |
+
+#### SparkStreamCopyClusteringPlanStrategy
+
+Available since Hudi 1.2.0,
`org.apache.hudi.client.clustering.plan.strategy.SparkStreamCopyClusteringPlanStrategy`
+is a Spark-only plan strategy that performs binary file stitching (byte-level
copy) rather than re-reading and
+re-writing records. This can be significantly faster when the goal is simply
to coalesce small files and sort order is
+not required. It is paired with
+`org.apache.hudi.client.clustering.run.strategy.SparkStreamCopyClusteringExecutionStrategy`.
+
+#### Single-Group Clustering Control
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.single.group.clustering.enabled` | `true` |
Whether to generate a clustering plan when only one file group is eligible. Set
to `false` to skip clustering when there is nothing meaningful to consolidate
(i.e., the partition already has a single file group). |
+
+#### File-Slice Sort Order in Plan Generation
Review Comment:
minor.
```
File-Slice Sort Order in Clustering Plan Generation
```
##########
website/docs/ingestion_flink.md:
##########
@@ -185,7 +194,7 @@ Hudi Flink writer supports two types of writer indexes:
| Cross‑Partition Changes | Cannot handle changes among partitions (unless
input is a CDC stream)
| No limit on
handling cross‑partition changes
|
:::note
-Bucket index supports only the `UPSERT` write operation and cannot be used
with the [append mode](#append-mode) in Flink.
+Bucket index supports `UPSERT` write operations on both COW and MOR tables. As
of Hudi 1.2.0, MOR + bucket index + upsert is fully supported. Bucket index
cannot be used with the [append mode](#append-mode) in Flink.
Review Comment:
hey @danny0405 : is there anything to call out in our release docs towards
this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]