Re: [PR] [DOCS] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

via GitHub Wed, 13 Mar 2024 02:58:20 -0700


geserdugarov commented on code in PR #10856:
URL: https://github.com/apache/hudi/pull/10856#discussion_r1522896072



##########
website/docs/basic_configurations.md:
##########
@@ -101,15 +102,14 @@ Flink jobs using the SQL can be configured through the 
options in WITH clause. T
 | [hoodie.database.name](#hoodiedatabasename)                                  
                    | (N/A)                         | Database name to register 
to Hive metastore<br /> `Config Param: DATABASE_NAME`                           
                                                                                
                                                                                
                                                                                
                                                |
 | [hoodie.table.name](#hoodietablename)                                        
                    | (N/A)                         | Table name to register to 
Hive metastore<br /> `Config Param: TABLE_NAME`                                 
                                                                                
                                                                                
                                                                                
                                                |
 | [path](#path)                                                                
                    | (N/A)                         | Base path for the target 
hoodie table. The path would be created if it does not exist, otherwise a 
Hoodie table expects to be initialized successfully<br /> `Config Param: PATH`  
                                                                                
                                                                                
                                                       |
+| [read.commits.limit](#readcommitslimit)                                      
                    | (N/A)                         | The maximum number of 
commits allowed to read in each instant check, if it is streaming read, the avg 
read instants number per-second would be 
'read.commits.limit'/'read.streaming.check-interval', by default no limit<br /> 
`Config Param: READ_COMMITS_LIMIT`                                              
                                                                                
           |
 | [read.end-commit](#readend-commit)                                           
                    | (N/A)                         | End commit instant for 
reading, the commit time format should be 'yyyyMMddHHmmss'<br /> `Config Param: 
READ_END_COMMIT`                                                                
                                                                                
                                                                                
                                                   |
 | [read.start-commit](#readstart-commit)                                       
                    | (N/A)                         | Start commit instant for 
reading, the commit time format should be 'yyyyMMddHHmmss', by default reading 
from the latest instant for streaming read<br /> `Config Param: 
READ_START_COMMIT`                                                              
                                                                                
                                                                  |
 | [archive.max_commits](#archivemax_commits)                                   
                    | 50                            | Max number of commits to 
keep before archiving older commits into a sequential log, default 50<br /> 
`Config Param: ARCHIVE_MAX_COMMITS`                                             
                                                                                
                                                                                
                                                     |
 | [archive.min_commits](#archivemin_commits)                                   
                    | 40                            | Min number of commits to 
keep before archiving older commits into a sequential log, default 40<br /> 
`Config Param: ARCHIVE_MIN_COMMITS`                                             
                                                                                
                                                                                
                                                     |
 | [cdc.enabled](#cdcenabled)                                                   
                    | false                         | When enable, persist the 
change data if necessary, and can be queried as a CDC query mode<br /> `Config 
Param: CDC_ENABLED`                                                             
                                                                                
                                                                                
                                                  |
 | [cdc.supplemental.logging.mode](#cdcsupplementalloggingmode)                 
                    | DATA_BEFORE_AFTER             | Setting 'op_key_only' 
persists the 'op' and the record key only, setting 'data_before' persists the 
additional 'before' image, and setting 'data_before_after' persists the 
additional 'before' and 'after' images.<br /> `Config Param: 
SUPPLEMENTAL_LOGGING_MODE`                                                      
                                                                                
 |
 | [changelog.enabled](#changelogenabled)                                       
                    | false                         | Whether to keep all the 
intermediate changes, we try to keep all the changes of a record when enabled: 
1). The sink accept the UPDATE_BEFORE message; 2). The source try to emit every 
changes of a record. The semantics is best effort because the compaction job 
would finally merge all changes of a record into one.  default false to have 
UPSERT semantics<br /> `Config Param: CHANGELOG_ENABLED` |
-| [clean.async.enabled](#cleanasyncenabled)                                    
                    | true                          | Whether to cleanup the 
old commits immediately on new commits, enabled by default<br /> `Config Param: 
CLEAN_ASYNC_ENABLED`                                                            
                                                                                
                                                                                
                                                   |
-| [clean.retain_commits](#cleanretain_commits)                                 
                    | 30                            | Number of commits to 
retain. So data will be retained for num_of_commits * time_between_commits 
(scheduled). This also directly translates into how much you can incrementally 
pull on this table, default 30<br /> `Config Param: CLEAN_RETAIN_COMMITS`       
                                                                                
                                                           |

Review Comment:
   Yes, both files `website/docs/basic_configurations.md` and 
`website/docs/configurations.md` were generated using 
[`hudi-utils/generate_config.sh`](https://github.com/apache/hudi/blob/asf-site/hudi-utils/generate_config.sh)
 in the `asf-site` branch.
   The process is described in 
[`hudi-utils/README.md`](https://github.com/apache/hudi/blob/asf-site/hudi-utils/README.md).
   
   Specifically these two removed lines are corresponding to the open [MR 
10851](https://github.com/apache/hudi/pull/10851), and should be merged only if 
the code change MR will be merged.



##########
website/docs/basic_configurations.md:
##########
@@ -101,15 +102,14 @@ Flink jobs using the SQL can be configured through the 
options in WITH clause. T
 | [hoodie.database.name](#hoodiedatabasename)                                  
                    | (N/A)                         | Database name to register 
to Hive metastore<br /> `Config Param: DATABASE_NAME`                           
                                                                                
                                                                                
                                                                                
                                                |
 | [hoodie.table.name](#hoodietablename)                                        
                    | (N/A)                         | Table name to register to 
Hive metastore<br /> `Config Param: TABLE_NAME`                                 
                                                                                
                                                                                
                                                                                
                                                |
 | [path](#path)                                                                
                    | (N/A)                         | Base path for the target 
hoodie table. The path would be created if it does not exist, otherwise a 
Hoodie table expects to be initialized successfully<br /> `Config Param: PATH`  
                                                                                
                                                                                
                                                       |
+| [read.commits.limit](#readcommitslimit)                                      
                    | (N/A)                         | The maximum number of 
commits allowed to read in each instant check, if it is streaming read, the avg 
read instants number per-second would be 
'read.commits.limit'/'read.streaming.check-interval', by default no limit<br /> 
`Config Param: READ_COMMITS_LIMIT`                                              
                                                                                
           |
 | [read.end-commit](#readend-commit)                                           
                    | (N/A)                         | End commit instant for 
reading, the commit time format should be 'yyyyMMddHHmmss'<br /> `Config Param: 
READ_END_COMMIT`                                                                
                                                                                
                                                                                
                                                   |
 | [read.start-commit](#readstart-commit)                                       
                    | (N/A)                         | Start commit instant for 
reading, the commit time format should be 'yyyyMMddHHmmss', by default reading 
from the latest instant for streaming read<br /> `Config Param: 
READ_START_COMMIT`                                                              
                                                                                
                                                                  |
 | [archive.max_commits](#archivemax_commits)                                   
                    | 50                            | Max number of commits to 
keep before archiving older commits into a sequential log, default 50<br /> 
`Config Param: ARCHIVE_MAX_COMMITS`                                             
                                                                                
                                                                                
                                                     |
 | [archive.min_commits](#archivemin_commits)                                   
                    | 40                            | Min number of commits to 
keep before archiving older commits into a sequential log, default 40<br /> 
`Config Param: ARCHIVE_MIN_COMMITS`                                             
                                                                                
                                                                                
                                                     |
 | [cdc.enabled](#cdcenabled)                                                   
                    | false                         | When enable, persist the 
change data if necessary, and can be queried as a CDC query mode<br /> `Config 
Param: CDC_ENABLED`                                                             
                                                                                
                                                                                
                                                  |
 | [cdc.supplemental.logging.mode](#cdcsupplementalloggingmode)                 
                    | DATA_BEFORE_AFTER             | Setting 'op_key_only' 
persists the 'op' and the record key only, setting 'data_before' persists the 
additional 'before' image, and setting 'data_before_after' persists the 
additional 'before' and 'after' images.<br /> `Config Param: 
SUPPLEMENTAL_LOGGING_MODE`                                                      
                                                                                
 |
 | [changelog.enabled](#changelogenabled)                                       
                    | false                         | Whether to keep all the 
intermediate changes, we try to keep all the changes of a record when enabled: 
1). The sink accept the UPDATE_BEFORE message; 2). The source try to emit every 
changes of a record. The semantics is best effort because the compaction job 
would finally merge all changes of a record into one.  default false to have 
UPSERT semantics<br /> `Config Param: CHANGELOG_ENABLED` |
-| [clean.async.enabled](#cleanasyncenabled)                                    
                    | true                          | Whether to cleanup the 
old commits immediately on new commits, enabled by default<br /> `Config Param: 
CLEAN_ASYNC_ENABLED`                                                            
                                                                                
                                                                                
                                                   |
-| [clean.retain_commits](#cleanretain_commits)                                 
                    | 30                            | Number of commits to 
retain. So data will be retained for num_of_commits * time_between_commits 
(scheduled). This also directly translates into how much you can incrementally 
pull on this table, default 30<br /> `Config Param: CLEAN_RETAIN_COMMITS`       
                                                                                
                                                           |

Review Comment:
   Yes, both files `website/docs/basic_configurations.md` and 
`website/docs/configurations.md` were generated using 
[`hudi-utils/generate_config.sh`](https://github.com/apache/hudi/blob/asf-site/hudi-utils/generate_config.sh)
 in the `asf-site` branch.
   The process is described in 
[`hudi-utils/README.md`](https://github.com/apache/hudi/blob/asf-site/hudi-utils/README.md).
   
   Specifically, these two removed lines are corresponding to the open [MR 
10851](https://github.com/apache/hudi/pull/10851), and should be merged only if 
the code change MR will be merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [DOCS] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

Reply via email to