[hudi] branch asf-site updated: [HUDI-4339] Add example configuration for HoodieCleaner in docs (#6326)

yihua Mon, 29 Aug 2022 20:30:27 -0700

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 4c8f222070 [HUDI-4339] Add example configuration for HoodieCleaner in 
docs (#6326)
4c8f222070 is described below

commit 4c8f2220700e738cbff44782c1476f98f55d8f3e
Author: Manu <[email protected]>
AuthorDate: Tue Aug 30 11:30:15 2022 +0800

    [HUDI-4339] Add example configuration for HoodieCleaner in docs (#6326)
    
    Co-authored-by: Y Ethan Guo <[email protected]>
---
 website/docs/hoodie_cleaner.md                     | 80 ++++++++++++++++++----
 .../version-0.11.1/hoodie_cleaner.md               | 63 +++++++++++++++--
 .../version-0.12.0/hoodie_cleaner.md               | 62 +++++++++++++++--
 3 files changed, 179 insertions(+), 26 deletions(-)

diff --git a/website/docs/hoodie_cleaner.md b/website/docs/hoodie_cleaner.md
index 10f1aa2450..1687a0e065 100644
--- a/website/docs/hoodie_cleaner.md
+++ b/website/docs/hoodie_cleaner.md
@@ -14,15 +14,22 @@ each commit, to delete older file slices. It's recommended 
to leave this enabled
 When cleaning old files, you should be careful not to remove files that are 
being actively used by long running queries.
 Hudi cleaner currently supports the below cleaning policies to keep a certain 
number of commits or file versions:
 
-- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal 
cleaning policy that ensures the effect of 
-having lookback into all the changes that happened in the last X commits. 
Suppose a writer is ingesting data 
-into a Hudi dataset every 30 minutes and the longest running query can take 5 
hours to finish, then the user should 
-retain atleast the last 10 commits. With such a configuration, we ensure that 
the oldest version of a file is kept on 
-disk for at least 5 hours, thereby preventing the longest running query from 
failing at any point in time. Incremental cleaning is also possible using this 
policy.
-- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N 
number of file versions irrespective of time. 
-This policy is useful when it is known how many MAX versions of the file does 
one want to keep at any given time. 
-To achieve the same behaviour as before of preventing long running queries 
from failing, one should do their calculations 
-based on data patterns. Alternatively, this policy is also useful if a user 
just wants to maintain 1 latest version of the file.
+- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal 
cleaning policy that ensures the effect of
+  having lookback into all the changes that happened in the last X commits. 
Suppose a writer is ingesting data
+  into a Hudi dataset every 30 minutes and the longest running query can take 
5 hours to finish, then the user should
+  retain atleast the last 10 commits. With such a configuration, we ensure 
that the oldest version of a file is kept on
+  disk for at least 5 hours, thereby preventing the longest running query from 
failing at any point in time. Incremental cleaning is also possible using this 
policy.
+  Number of commits to retain can be configured by 
`hoodie.cleaner.commits.retained`.
+
+- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N 
number of file versions irrespective of time.
+  This policy is useful when it is known how many MAX versions of the file 
does one want to keep at any given time.
+  To achieve the same behaviour as before of preventing long running queries 
from failing, one should do their calculations
+  based on data patterns. Alternatively, this policy is also useful if a user 
just wants to maintain 1 latest version of the file.
+  Number of file versions to retain can be configured by 
`hoodie.cleaner.fileversions.retained`.
+
+- **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple 
and useful when knowing that you want to keep files at any given time.
+  Corresponding to commits with commit times older than the configured number 
of hours to be retained are cleaned.
+  Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
 
 ### Configurations
 For details about all possible configurations and their default values see the 
[configuration 
docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -32,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along 
with your data ingestio
 ingesting data, configs are available which enable you to run it 
[synchronously or 
asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
 
 You can use this command for running the cleaner independently:
-```java
-[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \
-  --props s3:///temp/hudi-ingestion-config/kafka-source.properties \
-  --target-base-path s3:///temp/hudi \
-  --spark-master yarn-cluster
 ```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+        Usage: <main class> [options]
+        Options:
+        --help, -h
+
+        --hoodie-conf
+        Any configuration that can be set in the properties file (using the CLI
+        parameter "--props") can also be passed command line using this
+        parameter. This can be repeated
+        Default: []
+        --props
+        path to properties file on localfs or dfs, with configurations for
+        hoodie client for cleaning
+        --spark-master
+        spark master to use.
+        Default: local[2]
+        * --target-base-path
+        base path for the hoodie table to be cleaner.
+```
+Some examples to run the cleaner.    
+Keep the latest 10 commits
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
+  --hoodie-conf hoodie.cleaner.commits.retained=10 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Keep the latest 3 file versions
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
+  --hoodie-conf hoodie.cleaner.fileversions.retained=3 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Clean commits older than 24 hours
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
+  --hoodie-conf hoodie.cleaner.hours.retained=24 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Note: The parallelism takes the min value of number of partitions to clean and 
`hoodie.cleaner.parallelism`.
 
 ### Run Asynchronously
 In case you wish to run the cleaner service asynchronously with writing, 
please configure the below:
@@ -54,4 +101,9 @@ CLI provides the below commands for cleaner service:
 - `clean showpartitions`
 - `cleans run`
 
+Example of cleaner keeping the latest 10 commits
+```
+cleans run --sparkMaster local --hoodieConfigs 
hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
+```
+
 You can find more details and the relevant code for these commands in 
[`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java)
 class. 
diff --git a/website/versioned_docs/version-0.11.1/hoodie_cleaner.md 
b/website/versioned_docs/version-0.11.1/hoodie_cleaner.md
index 34c1cf11d1..00aa7709f4 100644
--- a/website/versioned_docs/version-0.11.1/hoodie_cleaner.md
+++ b/website/versioned_docs/version-0.11.1/hoodie_cleaner.md
@@ -18,14 +18,18 @@ Hudi cleaner currently supports the below cleaning policies 
to keep a certain nu
 having lookback into all the changes that happened in the last X commits. 
Suppose a writer is ingesting data 
 into a Hudi dataset every 30 minutes and the longest running query can take 5 
hours to finish, then the user should 
 retain atleast the last 10 commits. With such a configuration, we ensure that 
the oldest version of a file is kept on 
-disk for at least 5 hours, thereby preventing the longest running query from 
failing at any point in time. Incremental cleaning is also possible using this 
policy.
+disk for at least 5 hours, thereby preventing the longest running query from 
failing at any point in time. Incremental cleaning is also possible using this 
policy. 
+Number of commits to retain can be configured by 
`hoodie.cleaner.commits.retained`.
+
 - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N 
number of file versions irrespective of time. 
 This policy is useful when it is known how many MAX versions of the file does 
one want to keep at any given time. 
 To achieve the same behaviour as before of preventing long running queries 
from failing, one should do their calculations 
 based on data patterns. Alternatively, this policy is also useful if a user 
just wants to maintain 1 latest version of the file.
+Number of file versions to retain can be configured by 
`hoodie.cleaner.fileversions.retained`.
+
 - **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple 
and useful when knowing that you want to keep files at any given time.
   Corresponding to commits with commit times older than the configured number 
of hours to be retained are cleaned.
-  Currently you can configure by parameter 'hoodie.cleaner.hours.retained'.
+  Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
 
 ### Configurations
 For details about all possible configurations and their default values see the 
[configuration 
docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -35,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along 
with your data ingestio
 ingesting data, configs are available which enable you to run it 
[synchronously or 
asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
 
 You can use this command for running the cleaner independently:
-```java
-[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \
-  --props s3:///temp/hudi-ingestion-config/kafka-source.properties \
-  --target-base-path s3:///temp/hudi \
-  --spark-master yarn-cluster
 ```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+        Usage: <main class> [options]
+        Options:
+        --help, -h
+
+        --hoodie-conf
+        Any configuration that can be set in the properties file (using the CLI
+        parameter "--props") can also be passed command line using this
+        parameter. This can be repeated
+        Default: []
+        --props
+        path to properties file on localfs or dfs, with configurations for
+        hoodie client for cleaning
+        --spark-master
+        spark master to use.
+        Default: local[2]
+        * --target-base-path
+        base path for the hoodie table to be cleaner.
+```
+Some examples to run the cleaner.    
+Keep the latest 10 commits
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
+  --hoodie-conf hoodie.cleaner.commits.retained=10 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Keep the latest 3 file versions
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
+  --hoodie-conf hoodie.cleaner.fileversions.retained=3 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Clean commits older than 24 hours
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
+  --hoodie-conf hoodie.cleaner.hours.retained=24 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Note: The parallelism takes the min value of number of partitions to clean and 
`hoodie.cleaner.parallelism`.
 
 ### Run Asynchronously
 In case you wish to run the cleaner service asynchronously with writing, 
please configure the below:
@@ -57,4 +101,9 @@ CLI provides the below commands for cleaner service:
 - `clean showpartitions`
 - `cleans run`
 
+Example of cleaner keeping the latest 10 commits
+```
+cleans run --sparkMaster local --hoodieConfigs 
hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
+```
+
 You can find more details and the relevant code for these commands in 
[`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java)
 class. 
diff --git a/website/versioned_docs/version-0.12.0/hoodie_cleaner.md 
b/website/versioned_docs/version-0.12.0/hoodie_cleaner.md
index 10f1aa2450..d4269d4904 100644
--- a/website/versioned_docs/version-0.12.0/hoodie_cleaner.md
+++ b/website/versioned_docs/version-0.12.0/hoodie_cleaner.md
@@ -19,10 +19,17 @@ having lookback into all the changes that happened in the 
last X commits. Suppos
 into a Hudi dataset every 30 minutes and the longest running query can take 5 
hours to finish, then the user should 
 retain atleast the last 10 commits. With such a configuration, we ensure that 
the oldest version of a file is kept on 
 disk for at least 5 hours, thereby preventing the longest running query from 
failing at any point in time. Incremental cleaning is also possible using this 
policy.
+  Number of commits to retain can be configured by 
`hoodie.cleaner.commits.retained`.
+
 - **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N 
number of file versions irrespective of time. 
 This policy is useful when it is known how many MAX versions of the file does 
one want to keep at any given time. 
 To achieve the same behaviour as before of preventing long running queries 
from failing, one should do their calculations 
 based on data patterns. Alternatively, this policy is also useful if a user 
just wants to maintain 1 latest version of the file.
+Number of file versions to retain can be configured by 
`hoodie.cleaner.fileversions.retained`.
+
+- **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple 
and useful when knowing that you want to keep files at any given time.
+  Corresponding to commits with commit times older than the configured number 
of hours to be retained are cleaned.
+  Currently you can configure by parameter `hoodie.cleaner.hours.retained`.
 
 ### Configurations
 For details about all possible configurations and their default values see the 
[configuration 
docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
@@ -32,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along 
with your data ingestio
 ingesting data, configs are available which enable you to run it 
[synchronously or 
asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
 
 You can use this command for running the cleaner independently:
-```java
-[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \
-  --props s3:///temp/hudi-ingestion-config/kafka-source.properties \
-  --target-base-path s3:///temp/hudi \
-  --spark-master yarn-cluster
 ```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+        Usage: <main class> [options]
+        Options:
+        --help, -h
+
+        --hoodie-conf
+        Any configuration that can be set in the properties file (using the CLI
+        parameter "--props") can also be passed command line using this
+        parameter. This can be repeated
+        Default: []
+        --props
+        path to properties file on localfs or dfs, with configurations for
+        hoodie client for cleaning
+        --spark-master
+        spark master to use.
+        Default: local[2]
+        * --target-base-path
+        base path for the hoodie table to be cleaner.
+```
+Some examples to run the cleaner.    
+Keep the latest 10 commits
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
+  --hoodie-conf hoodie.cleaner.commits.retained=10 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Keep the latest 3 file versions
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
+  --hoodie-conf hoodie.cleaner.fileversions.retained=3 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Clean commits older than 24 hours
+```
+spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+  --target-base-path /path/to/hoodie_table \
+  --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
+  --hoodie-conf hoodie.cleaner.hours.retained=24 \
+  --hoodie-conf hoodie.cleaner.parallelism=200
+```
+Note: The parallelism takes the min value of number of partitions to clean and 
`hoodie.cleaner.parallelism`.
 
 ### Run Asynchronously
 In case you wish to run the cleaner service asynchronously with writing, 
please configure the below:
@@ -54,4 +101,9 @@ CLI provides the below commands for cleaner service:
 - `clean showpartitions`
 - `cleans run`
 
+Example of cleaner keeping the latest 10 commits
+```
+cleans run --sparkMaster local --hoodieConfigs 
hoodie.cleaner.policy=KEEP_LATEST_COMMITS,hoodie.cleaner.commits.retained=3,hoodie.cleaner.parallelism=200
+```
+
 You can find more details and the relevant code for these commands in 
[`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java)
 class.

[hudi] branch asf-site updated: [HUDI-4339] Add example configuration for HoodieCleaner in docs (#6326)

Reply via email to