[
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505012#comment-13505012
]
Jason Lowe commented on MAPREDUCE-4815:
---------------------------------------
I think this will work well with a couple of caveats:
* Write permissions to the parent directory of the output directory is a new
implicit requirement over the original FileOutputFormat. I think in the vast
majority of cases it won't be a problem, but it is a potential
backwards-compatibility issue.
* There are existing output formats that override checkOutputSpecs() and
explicitly remove the verification step that outputDir doesn't exist (e.g.:
TeraOutputFormat). If we only support this new scheme, those output formats
could fail to commit since the rename in commitJob() will fail for a non-empty
destination directory. I think we should add this as an optimized path to
FileOutputFormat, but keep the original, iterative rename scheme if the output
directory isn't empty for backwards compatibility.
> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4815
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Jason Lowe
> Assignee: Bikas Saha
>
> If a job generates many files to commit then the commitJob method call at the
> end of the job can take minutes. This is a performance regression from 1.x,
> as 1.x had the tasks commit directly to the final output directory as they
> were completing and commitJob had very little to do. The commit work was
> processed in parallel and overlapped the processing of outstanding tasks. In
> 0.23/2.x, the commit is single-threaded and waits until all tasks have
> completed before commencing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira