pratyakshsharma commented on code in PR #9699: URL: https://github.com/apache/hudi/pull/9699#discussion_r1325451523
##########
website/docs/procedures.md:
##########
@@ -1458,6 +1463,47 @@ call show_compaction(table => 'test_hudi_table', limit
=> 1);
|-------------------|------------|---------|
| 20220408153707928 | compaction | 10 |
+### run_clean
+
+Run cleaner on a hoodie table.
+
+**Input**
+
+| Parameter Name
| Type | Required | Default Value | Description
|
+|---------------------------------------------------------------------------------------|---------|----------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| table
| String | Y | None | Name of table to be cleaned
|
+| schedule_in_line
| Boolean | N | true | Set "true" if you want to
schedule and run a clean. Set false if you have already scheduled a clean and
want to run that.
|
+| [clean_policy](/docs/next/configurations#hoodiecleanerpolicy)
| String | N | None |
org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used.
The cleaner service deletes older file slices files to re-claim space. Long
running query plans may often refer to older file slices and will break if
those are cleaned, before the query has had a chance to run. So, it is good to
make sure that the data is retained for more than the maximum query execution
time. By default, the cleaning policy is determined based on one of the
following configs explicitly set by the user (at most one of them can be set;
otherwise, KEEP_LATEST_COMMITS cleaning policy is used).
KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices
written; used when "hoodie.cleaner.fileversions.retained" is explicitly set
only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N
commits; used when "hoodie.cleaner.commits.retai
ned" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices
written in the last N hours based on the commit time; used when
"hoodie.cleaner.hours.retained" is explicitly set only. |
+| [retain_commits](/docs/next/configurations#hoodiecleanercommitsretained)
| Int | N | None | Number of commits to retain,
without cleaning. This will be retained for num_of_commits *
time_between_commits (scheduled). This also directly translates into how much
data retention the table supports for incremental queries.
|
Review Comment:
I think this is good to mention here that this property is used when
KEEP_LATEST_COMMITS is set as the cleaner policy.
##########
website/docs/procedures.md:
##########
@@ -1458,6 +1463,47 @@ call show_compaction(table => 'test_hudi_table', limit
=> 1);
|-------------------|------------|---------|
| 20220408153707928 | compaction | 10 |
+### run_clean
+
+Run cleaner on a hoodie table.
+
+**Input**
+
+| Parameter Name
| Type | Required | Default Value | Description
|
+|---------------------------------------------------------------------------------------|---------|----------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| table
| String | Y | None | Name of table to be cleaned
|
+| schedule_in_line
| Boolean | N | true | Set "true" if you want to
schedule and run a clean. Set false if you have already scheduled a clean and
want to run that.
|
+| [clean_policy](/docs/next/configurations#hoodiecleanerpolicy)
| String | N | None |
org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used.
The cleaner service deletes older file slices files to re-claim space. Long
running query plans may often refer to older file slices and will break if
those are cleaned, before the query has had a chance to run. So, it is good to
make sure that the data is retained for more than the maximum query execution
time. By default, the cleaning policy is determined based on one of the
following configs explicitly set by the user (at most one of them can be set;
otherwise, KEEP_LATEST_COMMITS cleaning policy is used).
KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices
written; used when "hoodie.cleaner.fileversions.retained" is explicitly set
only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N
commits; used when "hoodie.cleaner.commits.retai
ned" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices
written in the last N hours based on the commit time; used when
"hoodie.cleaner.hours.retained" is explicitly set only. |
+| [retain_commits](/docs/next/configurations#hoodiecleanercommitsretained)
| Int | N | None | Number of commits to retain,
without cleaning. This will be retained for num_of_commits *
time_between_commits (scheduled). This also directly translates into how much
data retention the table supports for incremental queries.
|
+| [hours_retained](/docs/next/configurations#hoodiecleanerhoursretained)
| Int | N | None | Number of hours for which
commits need to be retained. This config provides a more flexible option
ascompared to number of commits retained for cleaning service. Setting this
property ensures all the files, but the latest in a file group, corresponding
to commits with commit times older than the configured number of hours to be
retained are cleaned.
|
Review Comment:
nit: ascompared -> as compared
##########
website/docs/procedures.md:
##########
@@ -1458,6 +1463,47 @@ call show_compaction(table => 'test_hudi_table', limit
=> 1);
|-------------------|------------|---------|
| 20220408153707928 | compaction | 10 |
+### run_clean
+
+Run cleaner on a hoodie table.
+
+**Input**
+
+| Parameter Name
| Type | Required | Default Value | Description
|
+|---------------------------------------------------------------------------------------|---------|----------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| table
| String | Y | None | Name of table to be cleaned
|
+| schedule_in_line
| Boolean | N | true | Set "true" if you want to
schedule and run a clean. Set false if you have already scheduled a clean and
want to run that.
|
+| [clean_policy](/docs/next/configurations#hoodiecleanerpolicy)
| String | N | None |
org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used.
The cleaner service deletes older file slices files to re-claim space. Long
running query plans may often refer to older file slices and will break if
those are cleaned, before the query has had a chance to run. So, it is good to
make sure that the data is retained for more than the maximum query execution
time. By default, the cleaning policy is determined based on one of the
following configs explicitly set by the user (at most one of them can be set;
otherwise, KEEP_LATEST_COMMITS cleaning policy is used).
KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices
written; used when "hoodie.cleaner.fileversions.retained" is explicitly set
only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N
commits; used when "hoodie.cleaner.commits.retai
ned" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices
written in the last N hours based on the commit time; used when
"hoodie.cleaner.hours.retained" is explicitly set only. |
+| [retain_commits](/docs/next/configurations#hoodiecleanercommitsretained)
| Int | N | None | Number of commits to retain,
without cleaning. This will be retained for num_of_commits *
time_between_commits (scheduled). This also directly translates into how much
data retention the table supports for incremental queries.
|
+| [hours_retained](/docs/next/configurations#hoodiecleanerhoursretained)
| Int | N | None | Number of hours for which
commits need to be retained. This config provides a more flexible option
ascompared to number of commits retained for cleaning service. Setting this
property ensures all the files, but the latest in a file group, corresponding
to commits with commit times older than the configured number of hours to be
retained are cleaned.
|
Review Comment:
Again lets mention this is to be used with KEEP_LATEST_BY_HOURS policy
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
