yihua commented on code in PR #18867: URL: https://github.com/apache/hudi/pull/18867#discussion_r3318939182
########## website/docs/clustering.md: ########## @@ -134,6 +134,47 @@ dynamically expanding the buckets for bucket index datasets. :::note The latter two strategies are applicable only for the Spark engine. ::: +#### CommitBasedClusteringPlanStrategy + +Hudi 1.2.0 introduced `org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`, a plan +strategy that schedules clustering based on commit patterns rather than just file size. It groups file slices by the +commits that produced them, making it easier to cluster data written in specific time windows or under specific commit +criteria. + +| Config Name | Default | Description | +|---|---|---| +| `hoodie.clustering.plan.strategy.class` | `SparkSizeBasedClusteringPlanStrategy` | Set to `org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy` to use commit-based planning. | +| `hoodie.clustering.plan.strategy.earliest.commit.to.cluster` | (none) | Earliest commit time (exclusive) to start clustering from. Only commits after this instant are considered. Useful for incrementally clustering new data while skipping already-clustered history. | + +#### SparkStreamCopyClusteringPlanStrategy + +Available since Hudi 1.2.0, `org.apache.hudi.client.clustering.plan.strategy.SparkStreamCopyClusteringPlanStrategy` +is a Spark-only plan strategy that performs binary file stitching (byte-level copy) rather than re-reading and +re-writing records. This can be significantly faster when the goal is simply to coalesce small files and sort order is +not required. It is paired with +`org.apache.hudi.client.clustering.run.strategy.SparkStreamCopyClusteringExecutionStrategy`. + +#### Single-Group Clustering Control + +| Config Name | Default | Description | +|---|---|---| +| `hoodie.clustering.plan.strategy.single.group.clustering.enabled` | `true` | Whether to generate a clustering plan when only one file group is eligible. Set to `false` to skip clustering when there is nothing meaningful to consolidate (i.e., the partition already has a single file group). | + +#### File-Slice Sort Order in Plan Generation Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
