steveloughran commented on PR #7425: URL: https://github.com/apache/hadoop/pull/7425#issuecomment-2678822267
I don't want to anywhere near that code as it is (a) critical and (b) and incredibly complicated co-recursive mix of two algorithms where you have to step though with a debugger to work out WTF is going wrong. It isn't suited to cloud storage and even with HDFS, it hits limits due to lack of parallelisation. So sorry, no, I don't want to touch this. There's just too much risk. At the same time, if we can speed up that manifest committer, there's appeal there. Glancing at the RenameFilesStage, it already remembers if a dir had to be created -and if so knows there's nothing at the far end. Otherwise it does that probe + delete. An optimistic commit there may have benefits, especially with azure where the HEAD probe will double the IO load of any rename, and job commit can put a lot of strain on IO quotas. Can you take a look there? I'm going to recommend * start with base manifest committer and your normal workload * set up a dir `mapreduce.manifest.committer.summary.report.directory`. This will save the iostatistics summary of the job for viewing, including summaries of number and duration of delete calls. * see if you could make RenameFilesStage.commitOneFile more optimistic. Ultimately this'd have to be made optional, but for an experiment it'd be good to see what gains you get Test on HDFS -works well there *and is more performant than the older committer, due to the parallel renames*. A before/after test on abfs would be interesting too. ABFS is a special pain point here as it does have problems with rename under load; if that load can be reduced, then that's good. But if because the parent dirs are actually created, such as when committing into an empty directory tree, I wouldn't expect any change at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
