[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504319#comment-13504319
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4815:
----------------------------------------------------

The original code to do merging of directories was done with recovery in mind, 
but overlooked the performance implications. Here's a concrete proposal:

Assuming output directory is $parentDir/$outputDir
 - On job submission, OutputFormat.checkOutputSpecs() will verify that 
$outputDir doesn't already exist.
 - OutputCommitter initialization(constructor) will create $outputDir if it 
doesn't exist
 - setupJob() will create 
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID/{noformat} and 
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID_tempTaskOutputs/{noformat}
 - setupTask() or on-demand file creation by task will create 
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID_tempTaskOutputs/$taskAttemptID{noformat}
 - commitTask() will move 
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID_tempTaskOutputs/$taskAttemptID{noformat}
 to 
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID/$taskAttemptID{noformat}
 - commitJob() will move 
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID/{noformat} to 
{{$outputDir}}
 - recoverJob() also will move 
{noformat}$parentDir/_tempJobOutput_$applicationAttemptID{noformat} to 
{noformat}$parentDir/_tempJobOutput_$recoveringApplicationAttemptID{noformat}

So, in sum per application-attempt, two top-level directories. No more per-task 
file-moves, everything is a atomic rename.

Thoughts?
                
> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to