[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364134#comment-14364134
 ] 

Gera Shegalov commented on MAPREDUCE-4815:
------------------------------------------

[~ivan.bella], thanks for reporting the problem. We need to capture this 
problem in a unit test. I chatted with [~l201514] offline, and need to verify 
my understanding of the problem.

Your reducer i outputs some unique paths under 
{{$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/dir}}

When multiple, say 2 reducers commit simultaneously  before $joboutput/dir is 
created too they both try
move: 
{code}$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/dir -> 
$joboutput/dir{code}

Assuming reducer 1 won, creating {{$joboutput/dir}}, we will have 1's files 
under {{$joboutput/dir}}, 2's files will be under  {{$joboutput/dir/dir}}


> Speed up FileOutputCommitter#commitJob for many output files
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>              Labels: perfomance
>             Fix For: 2.7.0
>
>         Attachments: MAPREDUCE-4815.v10.patch, MAPREDUCE-4815.v11.patch, 
> MAPREDUCE-4815.v12.patch, MAPREDUCE-4815.v13.patch, MAPREDUCE-4815.v14.patch, 
> MAPREDUCE-4815.v15.patch, MAPREDUCE-4815.v16.patch, MAPREDUCE-4815.v17.patch, 
> MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, MAPREDUCE-4815.v5.patch, 
> MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, MAPREDUCE-4815.v8.patch, 
> MAPREDUCE-4815.v9.patch
>
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to