[
https://issues.apache.org/jira/browse/HADOOP-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631607#comment-16631607
]
Steve Loughran commented on HADOOP-15782:
-----------------------------------------
bq. Only one attempt is allowed to commit via canCommit and with deterministic
output + atomic rename works anyway.
yes, but if that attempt is considered to have failed, an alternative attempt
may commit. At least, that's my reading of the code in Hadoop and spark. Now,
provided that only one file is generated by a task attempt, and the output of
any attempt can be accepted, then a long-delayed task commit *shouldnt* be
harmful, at least if it happens while the job is in progress. If it happens
after the job has completed, well, that's "unusual"
I've never seen that happening, it is a failure mode to be considered if you
want to be able to show that your algorithm is robust. The MR job committer
explicitly checks before job commit that it's had a recent heartbeat with the
YARN RM to avoid this and cluster partition problems at the job level; nothing
worries about it for individual tasks.
bq. even on HDFS v1 was noticeably non-atomic for large jobs and you need to
check for _SUCCESS or have another service recording completion, v2 was a big
improvement for Twitter.
Yes: neither committer is fully atomic at some phases in its operation
The S3A ones don't have atomic job commit either; just O(files) POST requests
which can be done in parallel. They do at least deliver atomic task commit (at
least I believe so...).
bq. I am just about to familiarize myself with Spark's use of FOC
it doesn't need to worry about job restart, so life is simpler. Still uses the
MRv1 APIs though, which they should be weaned off (And in Hadoop MR: deprecated)
> Clarify committers.md around v2 failure handling
> ------------------------------------------------
>
> Key: HADOOP-15782
> URL: https://issues.apache.org/jira/browse/HADOOP-15782
> Project: Hadoop Common
> Issue Type: Bug
> Components: documentation
> Affects Versions: 3.1.0, 3.1.1
> Reporter: Gera Shegalov
> Priority: Major
>
> The doc file
> {{hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md}}
> refers to the default file output committer (v2) as not supporting job and
> task recovery throughout the doc:
> {quote}or just by rerunning everything (The "v2" algorithm and Spark).
> {quote}
> This is incorrect.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]