[jira] [Commented] (HADOOP-15782) Clarify committers.md around v2 failure handling

Steve Loughran (JIRA) Fri, 28 Sep 2018 02:36:10 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631607#comment-16631607
 ]


Steve Loughran commented on HADOOP-15782:
-----------------------------------------

bq. Only one attempt is allowed to commit via canCommit and with deterministic 
output + atomic rename works anyway.

yes, but if that attempt is considered to have failed, an alternative attempt 
may commit. At least, that's my reading of the code in Hadoop and spark. Now, 
provided that only one file is generated by a task attempt, and the output of 
any attempt can be accepted, then a long-delayed task commit *shouldnt* be 
harmful, at least if it happens while the job is in progress. If it happens 
after the job has completed, well, that's "unusual"

I've never seen that happening, it is a failure mode to be considered if you 
want to be able to show that your algorithm is robust. The MR job committer 
explicitly checks before job commit that it's had a recent heartbeat with the 
YARN RM to avoid this and cluster partition problems at the job level; nothing 
worries about it for individual tasks.

bq. even on HDFS v1 was noticeably non-atomic for large jobs and you need to 
check for _SUCCESS or have another service recording completion, v2 was a big 
improvement for Twitter.

Yes: neither committer is fully atomic at some phases in its operation

The S3A ones don't have atomic job commit either; just O(files) POST requests 
which can be done in parallel. They do at least deliver atomic task commit (at 
least I believe so...).


bq. I am just about to familiarize myself with Spark's use of FOC 

it doesn't need to worry about job restart, so life is simpler. Still uses the 
MRv1 APIs though, which they should be weaned off (And in Hadoop MR: deprecated)




> Clarify committers.md around v2 failure handling
> ------------------------------------------------
>
>                 Key: HADOOP-15782
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15782
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 3.1.0, 3.1.1
>            Reporter: Gera Shegalov
>            Priority: Major
>
> The doc file 
> {{hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md}} 
> refers to the default file output committer (v2) as not supporting job and 
> task recovery throughout the doc:
> {quote}or just by rerunning everything (The "v2" algorithm and Spark).
> {quote}
> This is incorrect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15782) Clarify committers.md around v2 failure handling

Reply via email to