steveloughran commented on PR #7425:
URL: https://github.com/apache/hadoop/pull/7425#issuecomment-2678822267

   I don't want to anywhere near that code as it is (a) critical and (b) and 
incredibly complicated co-recursive mix of two algorithms where you have to 
step though with a debugger to work out WTF is going wrong.
   
   It isn't suited to cloud storage and even with HDFS, it hits limits due to 
lack of parallelisation.
   
   So sorry, no, I don't want to touch this. There's just too much risk.
   
   At the same time, if we can speed up that manifest committer, there's appeal 
there. Glancing at the RenameFilesStage, it already remembers if a dir had to 
be created -and if so knows there's nothing at the far end. Otherwise it does 
that probe + delete. 
   
   An optimistic commit there may have benefits, especially with azure where 
the HEAD probe will double the IO load of any rename, and job commit can put a 
lot of strain on IO quotas.
   
   Can you take a look there?
   
   I'm going to recommend
   * start with base manifest committer and your normal workload
   * set up a dir `mapreduce.manifest.committer.summary.report.directory`. This 
will save the iostatistics summary of the job for viewing, including summaries 
of number and duration of delete calls.
   * see if you could make RenameFilesStage.commitOneFile more optimistic. 
Ultimately this'd have to be made optional, but for an experiment it'd be good 
to see what gains you get
   
   Test on HDFS -works well there *and is more performant than the older 
committer, due to the parallel renames*.
   
   A before/after test on abfs would be interesting too. ABFS is a special pain 
point here as it does have problems with rename under load; if that load can be 
reduced, then that's good. But if because the parent dirs are actually created, 
such as when committing into an empty directory tree, I wouldn't expect any 
change at all.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to