n3nash edited a comment on issue #2841:
URL: https://github.com/apache/hudi/issues/2841#issuecomment-826556501
@AkshayChan Please find my answers inline
> 1. Is there a greatest value possible for the number of commits the hudi
cleaner can retain?
The num_commits is eventually converted to an integer, so you can pass
Integer.MAX_VALUE. Although, that will increase the number of files on the
.hoodie folder and eventually cause slowness due to file listing (if on S3)
> 2. What about the greatest possible value for `hoodie.keep.min.commits`
and `hoodie.keep.max.commits`?
See above. Ideally, you want to set the value of `hoodie.keep.min.commits`
to a lower value than `hoodie.keep.max.commits`. I've explained the meaning of
all these configs below.
> 3. What is the difference between `hoodie.cleaner.commits.retained` and
`hoodie.keep.min.commits`? If I understand correctly, the _cleaner commits
retained_ is the maximum number of commits we can query incrementally/point in
time from, _keep min commits_ is the least number of commits stored (for
example in S3) and _keep max commits_ is the maximum number of commits stored.
There are 2 different concepts at play here.
Hoodie Clean Process
The `hoodie.cleaner.commits.retained` is connected with the hoodie cleaner
process that ensures all files belonging to X number of commits will be
retained. Say, you set this value to 24. This means that MAX 24 versions of all
files will be retained at any time. This is a rolling window. If there are 25
versions, the first version will be cleaned.
Hoodie Commit File Archiving Process
Hoodie creates metadata files under the .hoodie folder that tracks the kinds
of actions happening on the table. These files ensure the ACID property that
provides isolation between readers and writers. The configs
`hoodie.keep.min.commits` and `hoodie.keep.max.commits` determine how many such
metadata files will be kept under .hoodie folder. Since a large number of files
under a folder can cause file listing slowness, hoodie employs an archival
process that archives these metadata files to a LOG file to reduce the number
of files under .hoodie folder. The archival process works as follows
Check if number of commit files under metadata has reached
`hoodie.keep.max.commits`, if yes, archive all commits files but keep
`hoodie.keep.min.commits` number of files. Keep doing this operation repeatedly
whenever conditions meet.
There is a relationship between the cleaner and archival. You need to keep
the `hoodie.cleaner.commits.retained` <= `hoodie.keep.min.commits`
> 4. Are any commits between the _keep min commits_ and _keep max commits_
thresholds archived as log files?
Answered above.,
> 5. Are any commits that exceed the max commit threshold deleted?
All files are archived under the archived.log* file names, not deleted.
> 6. Are there performance issues if I set these values very large? For
example:
>
> `hoodie.cleaner.commits.retained: 20000, hoodie.keep.min.commits: 20001,
hoodie.keep.max.commits: 30000`
>
> Thanks in advance
Yes, if you are using S3, file listing can slow down if you set to a very
large number (in the millions) and can also cause OOM.
Let me know if you have any more questions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]