parisni commented on issue #6373:
URL: https://github.com/apache/hudi/issues/6373#issuecomment-1234243547

   > I guess there is a problem to use incremental cleaning together with 
KEEP_LATEST_COMMITS which lead to never clean some partitions after a first 
clean
   
   See https://github.com/apache/hudi/pull/6498
   
   
   > you can leverage hoodie.clean.max.commits to reduce the frequency w/ which 
cleaner runs
   
   Here comes the problem: so far with OCC the cleaner is the only to rollback 
dead inflight commits. Then if I run cleaner less frequently those dead commit 
block the MDT compaction, which lead to increase the files to merge during MOR, 
and decrease MDT performances including cleaning. In my case I get often such 
dead commit since our platform often sigkill spark jobs. If I never run 
cleaning, those commit never leave and so the hfiles to merge. 
   
   > I have put up a fix to leverage metadata table incase of 
LATEST_FILE_VERSIONS.
   
   If this fix makes cleaning W/ MDT faster then it yes that would solve my 
issue. 
   But I don't think so. The reason is the method you replaced is not the 
bottleneck. The time is spent in asking the MDT for files to delete in that 
method: 
https://github.com/apache/hudi/blob/6e7ac457352e007939ba3c44c9dc197de7b88ed3/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L439
   Which is called here, 
   
https://github.com/apache/hudi/blob/52e63b39d6189beb3b381944ed553bb0052b12c9/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java#L113
   The cleaner first list partitions and then for each part list the parquet 
files. Those lookup are slow when you have 100k partition compared to lookup 
the file system. The more files to merge the MDT has, the slower the cleaning 
is ! 
   
   Again on the same table: clean w/ MDT with 4 log files: 4h
   Clean w/o MDT 5min.
    Then full file listing is quite fast.
   
   It would be helpful if we can configure cleaning method . in my case (100k 
partitions) I dlike to get MDT enabled but use pure file system for cleaning. 
   
   On August 30, 2022 11:41:19 PM UTC, Sivabalan Narayanan ***@***.***> wrote:
   >and wrt your statement `Also I guess there is a problem to use incremental 
cleaning together with KEEP_LATEST_COMMITS which lead to never clean some 
partitions after a first clean but I will open a separate issue for this one.`, 
if you happen to create a new issue, let me know. do tag me in there. 
   >
   >-- 
   >Reply to this email directly or view it on GitHub:
   >https://github.com/apache/hudi/issues/6373#issuecomment-1232280621
   >You are receiving this because you were mentioned.
   >
   >Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to