[hudi] branch asf-site updated: [HUDI-2805] - Docs for HoodieCleaner (#4052)

bhavanisudha Wed, 24 Nov 2021 14:25:29 -0800

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 570b5f5  [HUDI-2805] - Docs for HoodieCleaner (#4052)
570b5f5 is described below

commit 570b5f580050cc17fdad52b9ce4e13dd141d9557
Author: Kyle Weller <[email protected]>
AuthorDate: Wed Nov 24 15:25:12 2021 -0700

    [HUDI-2805] - Docs for HoodieCleaner (#4052)
    
    * added docs for hoodie cleaner
    
    * started cleaner docs
    
    * merged changes with upstream conflict
---
 website/docs/hoodie_cleaner.md | 57 ++++++++++++++++++++++++++++++++++++++++++
 website/sidebars.js            |  1 +
 2 files changed, 58 insertions(+)

diff --git a/website/docs/hoodie_cleaner.md b/website/docs/hoodie_cleaner.md
new file mode 100644
index 0000000..41956f5
--- /dev/null
+++ b/website/docs/hoodie_cleaner.md
@@ -0,0 +1,57 @@
+---
+title: Cleaning
+toc: true
+---
+
+Hoodie Cleaner is a utility that helps you reclaim space and keep your storage 
costs in check. Apache Hudi provides 
+snapshot isolation between writers and readers by managing multiple files with 
MVCC concurrency. These file versions 
+provide history and enable time travel and rollbacks, but it is important to 
manage how much history you keep to balance your costs.
+
+[Automatic Hudi cleaning](/docs/configurations/#hoodiecleanautomatic) is 
enabled by default. Cleaning is invoked immediately after
+each commit, to delete older file slices. It's recommended to leave this 
enabled to ensure metadata and data storage growth is bounded. 
+
+### Cleaning Retention Policies 
+When cleaning old files, you should be careful not to remove files that are 
being actively used by long running queries.
+Hudi cleaner currently supports the below cleaning policies to keep a certain 
number of commits or file versions:
+
+- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal 
cleaning policy that ensures the effect of 
+having lookback into all the changes that happened in the last X commits. 
Suppose a writer is ingesting data 
+into a Hudi dataset every 30 minutes and the longest running query can take 5 
hours to finish, then the user should 
+retain atleast the last 10 commits. With such a configuration, we ensure that 
the oldest version of a file is kept on 
+disk for at least 5 hours, thereby preventing the longest running query from 
failing at any point in time. Incremental cleaning is also possible using this 
policy.
+- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N 
number of file versions irrespective of time. 
+This policy is useful when it is known how many MAX versions of the file does 
one want to keep at any given time. 
+To achieve the same behaviour as before of preventing long running queries 
from failing, one should do their calculations 
+based on data patterns. Alternatively, this policy is also useful if a user 
just wants to maintain 1 latest version of the file.
+
+### Configurations
+For details about all possible configurations and their default values see the 
[configuration 
docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).
+
+### Run Independently
+Hoodie Cleaner can be run as a separate process or along with your data 
ingestion. In case you want to run it along with 
+ingesting data, configs are available which enable you to run it 
[synchronously or 
asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync).
+
+You can use this command for running the cleaner independently:
+```java
+[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \
+  --props s3:///temp/hudi-ingestion-config/kafka-source.properties \
+  --target-base-path s3:///temp/hudi \
+  --spark-master yarn-cluster
+```
+
+### Run Asynchronously
+In case you wish to run the cleaner service asynchronously with writing, 
please configure the below:
+```java
+hoodie.clean.automatic=true
+hoodie.clean.async=true
+```
+
+### CLI
+You can also use [Hudi CLI](https://hudi.apache.org/docs/deployment#cli) to 
run Hoodie Cleaner.
+
+CLI provides the below commands for cleaner service:
+- `cleans show`
+- `clean showpartitions`
+- `cleans run`
+
+You can find more details and the relevant code for these commands in 
[`org.apache.hudi.cli.commands.CleansCommand`](https://github.com/apache/hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CleansCommand.java)
 class. 
diff --git a/website/sidebars.js b/website/sidebars.js
index 42029b6..26a6602 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -46,6 +46,7 @@ module.exports = {
                 'migration_guide',
                 'compaction',
                 'clustering',
+                'hoodie_cleaner',
                 'snapshot_exporter'
             ],
         },

[hudi] branch asf-site updated: [HUDI-2805] - Docs for HoodieCleaner (#4052)

Reply via email to