[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271115#comment-14271115
 ] 

Jason Lowe commented on MAPREDUCE-4815:
---------------------------------------

Thanks for the design proposal, Gera.  I think that should work, and I'm a lot 
more comfortable with an approach that doesn't use an alternate top-level 
directory which will likely break existing, derived committers.  However the 
proposed design has some slight incompatibilities with how it behaves today:

- If a job crashes and fails to cleanup properly, partial output can appear in 
the output directory.  This is true today as well if the AM crashes in the 
middle of the rename process, but the window with the new design is much, much 
wider since files appear in the output directory as tasks complete rather than 
in a burst at the end of the job.
- The resolution of output filename collisions between tasks is deterministic 
today but would be dependent upon task completion order with the proposed 
design.  commitTask would need to be tolerant of renames to existing files due 
to this collision possibility.  This can also happen in the non-recovery case 
if the previous attempt crashed during a multi-file commit, since some of the 
output files are already there.  abortTask doesn't know enough about what is 
being committed to cleanup after the commit to avoid this.
- If a task attempt emits non-deterministic filenames then we could end up 
generating extra output after a recovery since we don't know which files in the 
output directory correspond to the previous task attempt if the attempt crashed 
mid-commit.

IIRC Hadoop 0.20/1.x also committed output files straight from tasks into the 
output directory and therefore would have the same issues as the first two 
above.  The first is mitigated by the _SUCCESS file, and consumers of the 
output should be sanity-checking that it exists before assuming all the data is 
there.  The second is mitigated by a "don't do that" approach or ensuring that 
the file contents are identical so it doesn't matter who wins.

The third issue is probably not a show stopper, since MapReduce doesn't work 
that well with non-deterministic tasks in general (e.g.: speculation and 
reattempts).  However we may want to implement this, at least initially, with a 
config or protected method that allows jobs or derived committers to request 
the original FileOutputCommitter behavior in case they need to preserve the 
current behavior.

About the task directory rename, I'm not sure it's strictly necessary.  The AM 
is tracking whether a task successfully completed in the jhist file, and a 
subsequent app attempt won't try to recover the output of a task unless it has 
received confirmation that the task completed and recorded that in the jhist 
file.  However we could keep the directory rename as an additional indication 
that the commit succeeded and the output was successfully placed in the output 
directory.  This allows the recoverTask method to report failure if it doesn't 
find that directory, although one would wonder how the AM received confirmation 
from the previous task that everything was OK and recorded that in the jhist.  
I'm torn on whether to keep this sanity check, as removing it allows us to 
eliminate another write operation on the namenode per task.  Without it 
recoverTask becomes a no-op in the new design, since if the previous task 
attempt completed then we know its output is already in the job output 
directory and there's nothing else to do for that task.

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, 
> MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, 
> MAPREDUCE-4815.v8.patch
>
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to