parisni commented on issue #6373: URL: https://github.com/apache/hudi/issues/6373#issuecomment-1234243547
> I guess there is a problem to use incremental cleaning together with KEEP_LATEST_COMMITS which lead to never clean some partitions after a first clean See https://github.com/apache/hudi/pull/6498 > you can leverage hoodie.clean.max.commits to reduce the frequency w/ which cleaner runs Here comes the problem: so far with OCC the cleaner is the only to rollback dead inflight commits. Then if I run cleaner less frequently those dead commit block the MDT compaction, which lead to increase the files to merge during MOR, and decrease MDT performances including cleaning. In my case I get often such dead commit since our platform often sigkill spark jobs. If I never run cleaning, those commit never leave and so the hfiles to merge. > I have put up a fix to leverage metadata table incase of LATEST_FILE_VERSIONS. If this fix makes cleaning W/ MDT faster then it yes that would solve my issue. But I don't think so. The reason is the method you replaced is not the bottleneck. The time is spent in asking the MDT for files to delete in that method: https://github.com/apache/hudi/blob/6e7ac457352e007939ba3c44c9dc197de7b88ed3/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L439 Which is called here, https://github.com/apache/hudi/blob/52e63b39d6189beb3b381944ed553bb0052b12c9/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java#L113 The cleaner first list partitions and then for each part list the parquet files. Those lookup are slow when you have 100k partition compared to lookup the file system. The more files to merge the MDT has, the slower the cleaning is ! Again on the same table: clean w/ MDT with 4 log files: 4h Clean w/o MDT 5min. Then full file listing is quite fast. It would be helpful if we can configure cleaning method . in my case (100k partitions) I dlike to get MDT enabled but use pure file system for cleaning. On August 30, 2022 11:41:19 PM UTC, Sivabalan Narayanan ***@***.***> wrote: >and wrt your statement `Also I guess there is a problem to use incremental cleaning together with KEEP_LATEST_COMMITS which lead to never clean some partitions after a first clean but I will open a separate issue for this one.`, if you happen to create a new issue, let me know. do tag me in there. > >-- >Reply to this email directly or view it on GitHub: >https://github.com/apache/hudi/issues/6373#issuecomment-1232280621 >You are receiving this because you were mentioned. > >Message ID: ***@***.***> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
