[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094819#comment-14094819
 ] 

Siqi Li commented on MAPREDUCE-4815:
------------------------------------

The approach I took is merging the output of each to a temporary directory 
whenever a task is finished
Assuming output directory is $parentDir/$outputDir

{code}
setupJob() will create
$parentDir/$outputDir_temporary/$attemptID
and
$parentDir/$outputDir_temporary/$attemptID_temporary

setupTask() or on-demand file creation by task will create
$parentDir/$outputDir_temporary/$attemptID_temporary/$taskAttemptID

commitTask() will move everything inside
$parentDir/$outputDir_temporary/$attemptID_temporary/$taskAttemptID
to
$parentDir/$outputDir_temporary/$attemptID

recoverJob() also will move
$parentDir/$outputDir_temporary/$previous_attemptID
to
$parentDir/$outputDir_temporary/$recovering_attemptID

if output directory doesn't exist, commitJob() will simply move 
$parentDir/$outputDir_temporary/$attemptID to $parentDir/$outputDir

if output directory does exist, copy all files from 
$parentDir/$outputDir_temporary/$attemptID to $parentDir/$outputDir
{code}

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to