codope commented on code in PR #7985: URL: https://github.com/apache/hudi/pull/7985#discussion_r1112016144
########## website/docs/clustering.md: ########## @@ -91,62 +91,111 @@ update strategy. ### Plan Strategy -This strategy comes into play while creating clustering plan. It helps to decide what file groups should be clustered. -Let's look at different plan strategies that are available with Hudi. Note that these strategies are easily pluggable -using this [config](/docs/configurations#hoodieclusteringplanstrategyclass). - -1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on - the [small file limit](/docs/configurations/#hoodieclusteringplanstrategysmallfilelimit) - of base files and creates clustering groups upto max file size allowed per group. The max size can be specified using - this [config](/docs/configurations/#hoodieclusteringplanstrategymaxbytespergroup). This - strategy is useful for stitching together medium-sized files into larger ones to reduce lot of files spread across - cold partitions. -2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days partitions and creates a plan that will - cluster the 'small' file slices within those partitions. This is the default strategy. It could be useful when the - workload is predictable and data is partitioned by time. -3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to cluster only specific partitions within a range, - no matter how old or new are those partitions, then this strategy could be useful. To use this strategy, one needs - to set below two configs additionally (both begin and end partitions are inclusive): +This strategy comes into play while creating clustering plan. It helps to decide what file groups should be clustered +and how many output file groups should the clustering produce. Note that these strategies are easily pluggable using the +config [hoodie.clustering.plan.strategy.class](/docs/configurations#hoodieclusteringplanstrategyclass). -``` -hoodie.clustering.plan.strategy.cluster.begin.partition -hoodie.clustering.plan.strategy.cluster.end.partition -``` +Different plan strategies are as follows: + +#### Size-based clustering strategies + +This strategy creates clustering groups based on max size allowed per group. Also, it excludes files that are greater +than the small file limit from the clustering plan. Available strategies depending on write client +are: `SparkSizeBasedClusteringPlanStrategy`, `FlinkSizeBasedClusteringPlanStrategy` +and `JavaSizeBasedClusteringPlanStrategy`. Furthermore, Hudi provides flexibility to include or exclude partitions for +clustering, tune the file size limits, maximum number of output groups, as we will see below. + +`hoodie.clustering.plan.strategy.partition.selected`: Comma separated list of partitions to be considered for +clustering. + +`hoodie.clustering.plan.strategy.partition.regex.pattern`: Filters clustering partitions that matched regex pattern. + +`hoodie.clustering.plan.partition.filter.mode`: In addition to previous filtering, we have few additional filtering as +well. Different values for this mode are `NONE`, `RECENT_DAYS` and `SELECTED_PARTITIONS`. Review Comment: Sorry, i should have looked at the enum. I have added now. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
