[
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271115#comment-14271115
]
Jason Lowe commented on MAPREDUCE-4815:
---------------------------------------
Thanks for the design proposal, Gera. I think that should work, and I'm a lot
more comfortable with an approach that doesn't use an alternate top-level
directory which will likely break existing, derived committers. However the
proposed design has some slight incompatibilities with how it behaves today:
- If a job crashes and fails to cleanup properly, partial output can appear in
the output directory. This is true today as well if the AM crashes in the
middle of the rename process, but the window with the new design is much, much
wider since files appear in the output directory as tasks complete rather than
in a burst at the end of the job.
- The resolution of output filename collisions between tasks is deterministic
today but would be dependent upon task completion order with the proposed
design. commitTask would need to be tolerant of renames to existing files due
to this collision possibility. This can also happen in the non-recovery case
if the previous attempt crashed during a multi-file commit, since some of the
output files are already there. abortTask doesn't know enough about what is
being committed to cleanup after the commit to avoid this.
- If a task attempt emits non-deterministic filenames then we could end up
generating extra output after a recovery since we don't know which files in the
output directory correspond to the previous task attempt if the attempt crashed
mid-commit.
IIRC Hadoop 0.20/1.x also committed output files straight from tasks into the
output directory and therefore would have the same issues as the first two
above. The first is mitigated by the _SUCCESS file, and consumers of the
output should be sanity-checking that it exists before assuming all the data is
there. The second is mitigated by a "don't do that" approach or ensuring that
the file contents are identical so it doesn't matter who wins.
The third issue is probably not a show stopper, since MapReduce doesn't work
that well with non-deterministic tasks in general (e.g.: speculation and
reattempts). However we may want to implement this, at least initially, with a
config or protected method that allows jobs or derived committers to request
the original FileOutputCommitter behavior in case they need to preserve the
current behavior.
About the task directory rename, I'm not sure it's strictly necessary. The AM
is tracking whether a task successfully completed in the jhist file, and a
subsequent app attempt won't try to recover the output of a task unless it has
received confirmation that the task completed and recorded that in the jhist
file. However we could keep the directory rename as an additional indication
that the commit succeeded and the output was successfully placed in the output
directory. This allows the recoverTask method to report failure if it doesn't
find that directory, although one would wonder how the AM received confirmation
from the previous task that everything was OK and recorded that in the jhist.
I'm torn on whether to keep this sanity check, as removing it allows us to
eliminate another write operation on the namenode per task. Without it
recoverTask becomes a no-op in the new design, since if the previous task
attempt completed then we know its output is already in the job output
directory and there's nothing else to do for that task.
> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4815
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
> Reporter: Jason Lowe
> Assignee: Siqi Li
> Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch,
> MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch,
> MAPREDUCE-4815.v8.patch
>
>
> If a job generates many files to commit then the commitJob method call at the
> end of the job can take minutes. This is a performance regression from 1.x,
> as 1.x had the tasks commit directly to the final output directory as they
> were completing and commitJob had very little to do. The commit work was
> processed in parallel and overlapped the processing of outstanding tasks. In
> 0.23/2.x, the commit is single-threaded and waits until all tasks have
> completed before commencing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)