[hudi] branch asf-site updated: [DOCS] Update clustering page with inline configs (#9429)

bhavanisudha Fri, 11 Aug 2023 10:43:22 -0700

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new a0dc7e93b5e [DOCS] Update clustering page with inline configs (#9429)
a0dc7e93b5e is described below

commit a0dc7e93b5eb7c0cab5bbd7b74dfd06230fc4dec
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Fri Aug 11 10:42:44 2023 -0700

    [DOCS] Update clustering page with inline configs (#9429)
---
 website/docs/clustering.md         | 65 ++++++++++----------------------------
 website/src/theme/DocPage/index.js |  2 +-
 2 files changed, 18 insertions(+), 49 deletions(-)

diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 8eb0dfbfaa1..2fb0f17c25e 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -103,50 +103,16 @@ This strategy creates clustering groups based on max size 
allowed per group. Als
 than the small file limit from the clustering plan. Available strategies 
depending on write client
 are: `SparkSizeBasedClusteringPlanStrategy`, 
`FlinkSizeBasedClusteringPlanStrategy`
 and `JavaSizeBasedClusteringPlanStrategy`. Furthermore, Hudi provides 
flexibility to include or exclude partitions for
-clustering, tune the file size limits, maximum number of output groups, as we 
will see below.
+clustering, tune the file size limits, maximum number of output groups. Please 
refer to 
[hoodie.clustering.plan.strategy.small.file.limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+, 
[hoodie.clustering.plan.strategy.max.num.groups](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxnumgroups),
 
[hoodie.clustering.plan.strategy.max.bytes.per.group](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+, 
[hoodie.clustering.plan.strategy.target.file.max.bytes](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategytargetfilemaxbytes)
 for more details.
 
-`hoodie.clustering.plan.strategy.partition.selected`: Comma separated list of 
partitions to be considered for
-clustering.
+| Config Name                                             | Default            
| Description                                                                   
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+|---------------------------------------------------------| 
-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| hoodie.clustering.plan.strategy.partition.selected      | N/A **(Required)** 
| Comma separated list of partitions to run clustering<br /><br />`Config 
Param: PARTITION_SELECTED`<br />`Since Version: 0.11.0`                         
                                                                                
                                                                                
                                                                                
                    [...]
+| hoodie.clustering.plan.strategy.partition.regex.pattern | N/A **(Required)** 
| Filter clustering partitions that matched regex pattern<br /><br />`Config 
Param: PARTITION_REGEX_PATTERN`<br />`Since Version: 0.11.0`                    
                                                                                
                                                                                
                                                                                
                 [...]
+| hoodie.clustering.plan.partition.filter.mode            | NONE (Optional)    
| Partition filter mode used in the creation of clustering plan. Possible 
values:<br /><ul><li>`NONE`: Do not filter partitions. The clustering plan will 
include all partitions that have clustering candidates.</li><li>`RECENT_DAYS`: 
This filter assumes that your data is partitioned by date. The clustering plan 
will only include partitions from K days ago to N days ago, where K &gt;= N. K 
is determined by `hood [...]
 
-`hoodie.clustering.plan.strategy.partition.regex.pattern`: Filters clustering 
partitions that matched regex pattern.
-
-`hoodie.clustering.plan.partition.filter.mode`: In addition to previous 
filtering, we have few additional filtering as
-well. Different values for this mode are `NONE`, `RECENT_DAYS` and 
`SELECTED_PARTITIONS`.
-
-- `NONE`: do not filter table partition and thus the clustering plan will 
include all partitions that have clustering
-  candidate.
-- `RECENT_DAYS`: keep a continuous range of partitions, works together with 
the below configs:
-   - `hoodie.clustering.plan.strategy.daybased.lookback.partitions`: Number of 
partitions to list to create
-     ClusteringPlan.
-   - `hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions`: 
Number of partitions to skip from latest when
-     choosing partitions to create ClusteringPlan. As the name implies, 
applicable only if partitioning is day based.
-- `SELECTED_PARTITIONS`: keep partitions that are in the specified range based 
on below configs:
-   - `hoodie.clustering.plan.strategy.cluster.begin.partition`: Begin 
partition used to filter partition (inclusive).
-   - `hoodie.clustering.plan.strategy.cluster.end.partition`: End partition 
used to filter partition (inclusive).
-- `DAY_ROLLING`: cluster partitions on a rolling basis by the hour to avoid 
clustering all partitions each time.
-
-**Small file limit**
-
-`hoodie.clustering.plan.strategy.small.file.limit`: Files smaller than the 
size in bytes specified here are candidates
-for clustering. Larges file groups will be ignored.
-
-**Max number of groups**
-
-`hoodie.clustering.plan.strategy.max.num.groups`: Maximum number of groups to 
create as part of ClusteringPlan.
-Increasing groups will increase parallelism. This does not imply the number of 
output file groups as such. This refers
-to clustering groups (parallel tasks/threads that will work towards producing 
output file groups). Total output file
-groups is also determined by based on target file size which we will discuss 
shortly.
-
-**Max bytes per group**
-
-`hoodie.clustering.plan.strategy.max.bytes.per.group`: Each clustering 
operation can create multiple output file groups.
-Total amount of data processed by clustering operation is defined by below two 
properties (Max bytes per group * Max num
-groups. Thus, this config will assist in capping the max amount of data to be 
included in one group.
-
-**Target file size max**
-
-`hoodie.clustering.plan.strategy.target.file.max.bytes`: Each group can 
produce ’N’ (max group size /target file size)
-output file groups.
 
 #### SparkSingleFileSortPlanStrategy
 
@@ -171,6 +137,10 @@ config 
[hoodie.clustering.execution.strategy.class](/docs/configurations/#hoodie
 default, Hudi sorts the file groups in the plan by the specified columns, 
while meeting the configured target file
 sizes.
 
+| Config Name                                 | Default                        
                                                                 | Description  
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| --------------------------------------------| 
-----------------------------------------------------------------------------------------------
 | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| hoodie.clustering.execution.strategy.class  | 
org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
 (Optional)     | Config to provide a strategy class (subclass of 
RunClusteringStrategy) to define how the  clustering plan is executed. By 
default, we sort the file groups in th plan by the specified columns, while  
meeting the configured target file sizes.<br /><br />`Config Param: 
EXECUTION_STRATEGY_CLASS_NAME`<br />`Since Version: 0.7.0`                      
[...]
+
 The available strategies are as follows:
 
 1. `SPARK_SORT_AND_SIZE_EXECUTION_STRATEGY`: Uses bulk_insert to re-write data 
from input file groups.
@@ -189,7 +159,7 @@ The available strategies are as follows:
 ### Update Strategy
 
 Currently, clustering can only be scheduled for tables/partitions not 
receiving any concurrent updates. By default,
-the [config for update 
strategy](/docs/configurations/#hoodieclusteringupdatesstrategy) is set to ***
+the config for update strategy - 
[`hoodie.clustering.updates.strategy`](/docs/configurations/#hoodieclusteringupdatesstrategy)
 is set to ***
 SparkRejectUpdateStrategy***. If some file group has updates during clustering 
then it will reject updates and throw an
 exception. However, in some use-cases updates are very sparse and do not touch 
most file groups. The default strategy to
 simply reject updates does not seem fair. In such use-cases, users can set the 
config to ***SparkAllowUpdateStrategy***.
@@ -237,11 +207,10 @@ Hudi supports 
[multi-writers](https://hudi.apache.org/docs/concurrency_control#e
 snapshot isolation between multiple table services, thus allowing writers to 
continue with ingestion while clustering
 runs in the background.
 
-|  Config key  | Remarks | Default |
-|  -----------  | -------  | ------- |
-| `hoodie.clustering.async.enabled` | Enable running of clustering service, 
asynchronously as writes happen on the table. | False |
-| `hoodie.clustering.async.max.commits` | Control frequency of async 
clustering by specifying after how many commits clustering should be triggered. 
| 4 |
-| `hoodie.clustering.preserve.commit.metadata` | When rewriting data, 
preserves existing _hoodie_commit_time. This means users can run incremental 
queries on clustered data without any side-effects. | False |
+| Config Name                                                                  
                       | Default                                 | Description  
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| 
---------------------------------------------------------------------------------------------------
 | --------------------------------------- | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| hoodie.clustering.async.enabled                                        | 
false (Optional)                        | Enable running of clustering service, 
asynchronously as inserts happen on the table.<br /><br />`Config Param: 
ASYNC_CLUSTERING_ENABLE`<br />`Since Version: 0.7.0`                            
                                                                                
                                                                                
                         [...]
+| hoodie.clustering.async.max.commits                                          
                  | 4 (Optional)                                                
                                    | Config to control frequency of async 
clustering<br /><br />`Config Param: ASYNC_CLUSTERING_MAX_COMMITS`<br />`Since 
Version: 0.9.0`                                                                 
                                                                                
                    [...]
 
 ## Setup Asynchronous Clustering
 Users can leverage 
[HoodieClusteringJob](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob)
diff --git a/website/src/theme/DocPage/index.js 
b/website/src/theme/DocPage/index.js
index 24b764ea044..00a31ceb295 100644
--- a/website/src/theme/DocPage/index.js
+++ b/website/src/theme/DocPage/index.js
@@ -128,7 +128,7 @@ function DocPageContent({
   );
 }
 
-const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, 
`${matchPath}/basic_configurations`, `${matchPath}/timeline`, 
`${matchPath}/table_types`, `${matchPath}/migration_guide`, 
`${matchPath}/compaction`];
+const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, 
`${matchPath}/basic_configurations`, `${matchPath}/timeline`, 
`${matchPath}/table_types`, `${matchPath}/migration_guide`, 
`${matchPath}/compaction`, `${matchPath}/clustering`];
 const showCustomStylesForDocs = (matchPath, pathname) => 
arrayOfPages(matchPath).includes(pathname);
 function DocPage(props) {
   const {

[hudi] branch asf-site updated: [DOCS] Update clustering page with inline configs (#9429)

Reply via email to