[
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504319#comment-13504319
]
Vinod Kumar Vavilapalli commented on MAPREDUCE-4815:
----------------------------------------------------
The original code to do merging of directories was done with recovery in mind,
but overlooked the performance implications. Here's a concrete proposal:
Assuming output directory is $parentDir/$outputDir
- On job submission, OutputFormat.checkOutputSpecs() will verify that
$outputDir doesn't already exist.
- OutputCommitter initialization(constructor) will create $outputDir if it
doesn't exist
- setupJob() will create
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID/{noformat} and
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID_tempTaskOutputs/{noformat}
- setupTask() or on-demand file creation by task will create
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID_tempTaskOutputs/$taskAttemptID{noformat}
- commitTask() will move
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID_tempTaskOutputs/$taskAttemptID{noformat}
to
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID/$taskAttemptID{noformat}
- commitJob() will move
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID/{noformat} to
{{$outputDir}}
- recoverJob() also will move
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID{noformat} to
{noformat}$parentDir/_tempJobOutput_$recoveringApplicationAttemptID{noformat}
So, in sum per application-attempt, two top-level directories. No more per-task
file-moves, everything is a atomic rename.
Thoughts?
> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4815
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha
> Reporter: Jason Lowe
> Assignee: Bikas Saha
>
> If a job generates many files to commit then the commitJob method call at the
> end of the job can take minutes. This is a performance regression from 1.x,
> as 1.x had the tasks commit directly to the final output directory as they
> were completing and commitJob had very little to do. The commit work was
> processed in parallel and overlapped the processing of outstanding tasks. In
> 0.23/2.x, the commit is single-threaded and waits until all tasks have
> completed before commencing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira