n3nash edited a comment on issue #2841:
URL: https://github.com/apache/hudi/issues/2841#issuecomment-826556501


   @AkshayChan Please find my answers inline 
   
   > 1. Is there a greatest value possible for the number of commits the hudi 
cleaner can retain?
   
   The num_commits is eventually converted to an integer, so you can pass 
Integer.MAX_VALUE. Although, that will increase the number of files on the 
.hoodie folder and eventually cause slowness due to file listing (if on S3)
   
   > 2. What about the greatest possible value for `hoodie.keep.min.commits` 
and `hoodie.keep.max.commits`?
   
   See above. Ideally, you want to set the value of `hoodie.keep.min.commits` 
to a lower value than `hoodie.keep.max.commits`. I've explained the meaning of 
all these configs below. 
   
   > 3. What is the difference between `hoodie.cleaner.commits.retained` and 
`hoodie.keep.min.commits`? If I understand correctly, the _cleaner commits 
retained_ is the maximum number of commits we can query incrementally/point in 
time from, _keep min commits_ is the least number of commits stored (for 
example in S3) and _keep max commits_ is the maximum number of commits stored.
   
   There are 2 different concepts at play here. 
   
   Hoodie Clean Process
   
   The `hoodie.cleaner.commits.retained` is connected with the hoodie cleaner 
process that ensures all files belonging to X number of commits will be 
retained. Say, you set this value to 24. This means that MAX 24 versions of all 
files will be retained at any time. This is a rolling window. If there are 25 
versions, the first version will be cleaned.
   
    Hoodie Commit File Archiving Process
   
   Hoodie creates metadata files under the .hoodie folder that tracks the kinds 
of actions happening on the table. These files ensure the ACID property that 
provides isolation between readers and writers. The configs 
`hoodie.keep.min.commits` and `hoodie.keep.max.commits` determine how many such 
metadata files will be kept under .hoodie folder. Since a large number of files 
under a folder can cause file listing slowness, hoodie employs an archival 
process that archives these metadata files to a LOG file to reduce the number 
of files under .hoodie folder. The  archival process works as follows 
   
   Check if number of commit files under metadata has reached 
`hoodie.keep.max.commits`, if yes, archive all commits files but keep 
`hoodie.keep.min.commits` number of files. Keep doing this operation repeatedly 
whenever conditions meet.
   
   There is a relationship between the cleaner and archival. You need to keep 
the `hoodie.cleaner.commits.retained` <= `hoodie.keep.min.commits`
   
   > 4. Are any commits between the _keep min commits_ and _keep max commits_ 
thresholds archived as log files?
   
   Answered above.,
   
   > 5. Are any commits that exceed the max commit threshold deleted?
   
   All files are archived under the archived.log* file names, not deleted. 
   
   > 6. Are there performance issues if I set these values very large? For 
example:
   > 
   > `hoodie.cleaner.commits.retained: 20000, hoodie.keep.min.commits: 20001, 
hoodie.keep.max.commits: 30000`
   > 
   > Thanks in advance
   
   Yes, if you are using S3, file listing can slow down if you set to a very 
large number (in the millions) and can also cause OOM.
   
   Let me know if you have any more questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to