This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new a0dc7e93b5e [DOCS] Update clustering page with inline configs (#9429)
a0dc7e93b5e is described below
commit a0dc7e93b5eb7c0cab5bbd7b74dfd06230fc4dec
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Fri Aug 11 10:42:44 2023 -0700
[DOCS] Update clustering page with inline configs (#9429)
---
website/docs/clustering.md | 65 ++++++++++----------------------------
website/src/theme/DocPage/index.js | 2 +-
2 files changed, 18 insertions(+), 49 deletions(-)
diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 8eb0dfbfaa1..2fb0f17c25e 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -103,50 +103,16 @@ This strategy creates clustering groups based on max size
allowed per group. Als
than the small file limit from the clustering plan. Available strategies
depending on write client
are: `SparkSizeBasedClusteringPlanStrategy`,
`FlinkSizeBasedClusteringPlanStrategy`
and `JavaSizeBasedClusteringPlanStrategy`. Furthermore, Hudi provides
flexibility to include or exclude partitions for
-clustering, tune the file size limits, maximum number of output groups, as we
will see below.
+clustering, tune the file size limits, maximum number of output groups. Please
refer to
[hoodie.clustering.plan.strategy.small.file.limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+,
[hoodie.clustering.plan.strategy.max.num.groups](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxnumgroups),
[hoodie.clustering.plan.strategy.max.bytes.per.group](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+,
[hoodie.clustering.plan.strategy.target.file.max.bytes](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategytargetfilemaxbytes)
for more details.
-`hoodie.clustering.plan.strategy.partition.selected`: Comma separated list of
partitions to be considered for
-clustering.
+| Config Name | Default
| Description
[...]
+|---------------------------------------------------------|
-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| hoodie.clustering.plan.strategy.partition.selected | N/A **(Required)**
| Comma separated list of partitions to run clustering<br /><br />`Config
Param: PARTITION_SELECTED`<br />`Since Version: 0.11.0`
[...]
+| hoodie.clustering.plan.strategy.partition.regex.pattern | N/A **(Required)**
| Filter clustering partitions that matched regex pattern<br /><br />`Config
Param: PARTITION_REGEX_PATTERN`<br />`Since Version: 0.11.0`
[...]
+| hoodie.clustering.plan.partition.filter.mode | NONE (Optional)
| Partition filter mode used in the creation of clustering plan. Possible
values:<br /><ul><li>`NONE`: Do not filter partitions. The clustering plan will
include all partitions that have clustering candidates.</li><li>`RECENT_DAYS`:
This filter assumes that your data is partitioned by date. The clustering plan
will only include partitions from K days ago to N days ago, where K >= N. K
is determined by `hood [...]
-`hoodie.clustering.plan.strategy.partition.regex.pattern`: Filters clustering
partitions that matched regex pattern.
-
-`hoodie.clustering.plan.partition.filter.mode`: In addition to previous
filtering, we have few additional filtering as
-well. Different values for this mode are `NONE`, `RECENT_DAYS` and
`SELECTED_PARTITIONS`.
-
-- `NONE`: do not filter table partition and thus the clustering plan will
include all partitions that have clustering
- candidate.
-- `RECENT_DAYS`: keep a continuous range of partitions, works together with
the below configs:
- - `hoodie.clustering.plan.strategy.daybased.lookback.partitions`: Number of
partitions to list to create
- ClusteringPlan.
- - `hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions`:
Number of partitions to skip from latest when
- choosing partitions to create ClusteringPlan. As the name implies,
applicable only if partitioning is day based.
-- `SELECTED_PARTITIONS`: keep partitions that are in the specified range based
on below configs:
- - `hoodie.clustering.plan.strategy.cluster.begin.partition`: Begin
partition used to filter partition (inclusive).
- - `hoodie.clustering.plan.strategy.cluster.end.partition`: End partition
used to filter partition (inclusive).
-- `DAY_ROLLING`: cluster partitions on a rolling basis by the hour to avoid
clustering all partitions each time.
-
-**Small file limit**
-
-`hoodie.clustering.plan.strategy.small.file.limit`: Files smaller than the
size in bytes specified here are candidates
-for clustering. Larges file groups will be ignored.
-
-**Max number of groups**
-
-`hoodie.clustering.plan.strategy.max.num.groups`: Maximum number of groups to
create as part of ClusteringPlan.
-Increasing groups will increase parallelism. This does not imply the number of
output file groups as such. This refers
-to clustering groups (parallel tasks/threads that will work towards producing
output file groups). Total output file
-groups is also determined by based on target file size which we will discuss
shortly.
-
-**Max bytes per group**
-
-`hoodie.clustering.plan.strategy.max.bytes.per.group`: Each clustering
operation can create multiple output file groups.
-Total amount of data processed by clustering operation is defined by below two
properties (Max bytes per group * Max num
-groups. Thus, this config will assist in capping the max amount of data to be
included in one group.
-
-**Target file size max**
-
-`hoodie.clustering.plan.strategy.target.file.max.bytes`: Each group can
produce āNā (max group size /target file size)
-output file groups.
#### SparkSingleFileSortPlanStrategy
@@ -171,6 +137,10 @@ config
[hoodie.clustering.execution.strategy.class](/docs/configurations/#hoodie
default, Hudi sorts the file groups in the plan by the specified columns,
while meeting the configured target file
sizes.
+| Config Name | Default
| Description
[...]
+| --------------------------------------------|
-----------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| hoodie.clustering.execution.strategy.class |
org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
(Optional) | Config to provide a strategy class (subclass of
RunClusteringStrategy) to define how the clustering plan is executed. By
default, we sort the file groups in th plan by the specified columns, while
meeting the configured target file sizes.<br /><br />`Config Param:
EXECUTION_STRATEGY_CLASS_NAME`<br />`Since Version: 0.7.0`
[...]
+
The available strategies are as follows:
1. `SPARK_SORT_AND_SIZE_EXECUTION_STRATEGY`: Uses bulk_insert to re-write data
from input file groups.
@@ -189,7 +159,7 @@ The available strategies are as follows:
### Update Strategy
Currently, clustering can only be scheduled for tables/partitions not
receiving any concurrent updates. By default,
-the [config for update
strategy](/docs/configurations/#hoodieclusteringupdatesstrategy) is set to ***
+the config for update strategy -
[`hoodie.clustering.updates.strategy`](/docs/configurations/#hoodieclusteringupdatesstrategy)
is set to ***
SparkRejectUpdateStrategy***. If some file group has updates during clustering
then it will reject updates and throw an
exception. However, in some use-cases updates are very sparse and do not touch
most file groups. The default strategy to
simply reject updates does not seem fair. In such use-cases, users can set the
config to ***SparkAllowUpdateStrategy***.
@@ -237,11 +207,10 @@ Hudi supports
[multi-writers](https://hudi.apache.org/docs/concurrency_control#e
snapshot isolation between multiple table services, thus allowing writers to
continue with ingestion while clustering
runs in the background.
-| Config key | Remarks | Default |
-| ----------- | ------- | ------- |
-| `hoodie.clustering.async.enabled` | Enable running of clustering service,
asynchronously as writes happen on the table. | False |
-| `hoodie.clustering.async.max.commits` | Control frequency of async
clustering by specifying after how many commits clustering should be triggered.
| 4 |
-| `hoodie.clustering.preserve.commit.metadata` | When rewriting data,
preserves existing _hoodie_commit_time. This means users can run incremental
queries on clustered data without any side-effects. | False |
+| Config Name
| Default | Description
[...]
+|
---------------------------------------------------------------------------------------------------
| --------------------------------------- |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| hoodie.clustering.async.enabled |
false (Optional) | Enable running of clustering service,
asynchronously as inserts happen on the table.<br /><br />`Config Param:
ASYNC_CLUSTERING_ENABLE`<br />`Since Version: 0.7.0`
[...]
+| hoodie.clustering.async.max.commits
| 4 (Optional)
| Config to control frequency of async
clustering<br /><br />`Config Param: ASYNC_CLUSTERING_MAX_COMMITS`<br />`Since
Version: 0.9.0`
[...]
## Setup Asynchronous Clustering
Users can leverage
[HoodieClusteringJob](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob)
diff --git a/website/src/theme/DocPage/index.js
b/website/src/theme/DocPage/index.js
index 24b764ea044..00a31ceb295 100644
--- a/website/src/theme/DocPage/index.js
+++ b/website/src/theme/DocPage/index.js
@@ -128,7 +128,7 @@ function DocPageContent({
);
}
-const arrayOfPages = (matchPath) => [`${matchPath}/configurations`,
`${matchPath}/basic_configurations`, `${matchPath}/timeline`,
`${matchPath}/table_types`, `${matchPath}/migration_guide`,
`${matchPath}/compaction`];
+const arrayOfPages = (matchPath) => [`${matchPath}/configurations`,
`${matchPath}/basic_configurations`, `${matchPath}/timeline`,
`${matchPath}/table_types`, `${matchPath}/migration_guide`,
`${matchPath}/compaction`, `${matchPath}/clustering`];
const showCustomStylesForDocs = (matchPath, pathname) =>
arrayOfPages(matchPath).includes(pathname);
function DocPage(props) {
const {