[GitHub] [hudi] codope commented on a diff in pull request #7985: [DOCS] Update clustering docs

via GitHub Mon, 20 Feb 2023 06:23:32 -0800


codope commented on code in PR #7985:
URL: https://github.com/apache/hudi/pull/7985#discussion_r1112016144



##########
website/docs/clustering.md:
##########
@@ -91,62 +91,111 @@ update strategy.
 
 ### Plan Strategy
 
-This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
-Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
-using this [config](/docs/configurations#hoodieclusteringplanstrategyclass).
-
-1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
-   the [small file 
limit](/docs/configurations/#hoodieclusteringplanstrategysmallfilelimit)
-   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
-   this 
[config](/docs/configurations/#hoodieclusteringplanstrategymaxbytespergroup). 
This
-   strategy is useful for stitching together medium-sized files into larger 
ones to reduce lot of files spread across
-   cold partitions.
-2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
-   cluster the 'small' file slices within those partitions. This is the 
default strategy. It could be useful when the
-   workload is predictable and data is partitioned by time.
-3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
-   no matter how old or new are those partitions, then this strategy could be 
useful. To use this strategy, one needs
-   to set below two configs additionally (both begin and end partitions are 
inclusive):
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered
+and how many output file groups should the clustering produce. Note that these 
strategies are easily pluggable using the
+config 
[hoodie.clustering.plan.strategy.class](/docs/configurations#hoodieclusteringplanstrategyclass).
 
-```
-hoodie.clustering.plan.strategy.cluster.begin.partition
-hoodie.clustering.plan.strategy.cluster.end.partition
-```
+Different plan strategies are as follows:
+
+#### Size-based clustering strategies
+
+This strategy creates clustering groups based on max size allowed per group. 
Also, it excludes files that are greater
+than the small file limit from the clustering plan. Available strategies 
depending on write client
+are: `SparkSizeBasedClusteringPlanStrategy`, 
`FlinkSizeBasedClusteringPlanStrategy`
+and `JavaSizeBasedClusteringPlanStrategy`. Furthermore, Hudi provides 
flexibility to include or exclude partitions for
+clustering, tune the file size limits, maximum number of output groups, as we 
will see below.
+
+`hoodie.clustering.plan.strategy.partition.selected`: Comma separated list of 
partitions to be considered for
+clustering.
+
+`hoodie.clustering.plan.strategy.partition.regex.pattern`: Filters clustering 
partitions that matched regex pattern.
+
+`hoodie.clustering.plan.partition.filter.mode`: In addition to previous 
filtering, we have few additional filtering as
+well. Different values for this mode are `NONE`, `RECENT_DAYS` and 
`SELECTED_PARTITIONS`.

Review Comment:
   Sorry, i should have looked at the enum. I have added now. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on a diff in pull request #7985: [DOCS] Update clustering docs

Reply via email to