[
https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230263#comment-15230263
]
Jason Lowe commented on TEZ-3198:
---------------------------------
Note it's not just for 500 internal server errors, as we've seen this failure
for a case where the nodemanager lost the shuffle secret (was accidentally
rebooted without recovering the NM state store). The NM quickly rejected the
request because it didn't recognize the shuffle secret and the job failed in a
similar manner. If we're relying on connect/read timeouts to reach the read
error duration threshold then I assume a case where the nodemanager crashes
would also trigger this problem. A crashed nodemanager is going to return a
connection refused error very quickly, so the task won't spend much time
between retries.
> Shuffle failures for the trailing task in a vertex are often fatal to the
> entire DAG
> ------------------------------------------------------------------------------------
>
> Key: TEZ-3198
> URL: https://issues.apache.org/jira/browse/TEZ-3198
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0, 0.8.2
> Reporter: Jason Lowe
> Priority: Critical
>
> I've seen an increasing number of cases where a single-node failure caused
> the whole Tez DAG to fail. These scenarios are common in that they involve
> the last task of a vertex attempting to complete a shuffle where all the peer
> tasks have already finished shuffling. The last task's attempt encounters
> errors shuffling one of its inputs and keeps reporting it to the AM.
> Eventually the attempt decides it must be the cause of the shuffle error and
> fails. The subsequent attempts all do the same thing, and eventually we hit
> the task max attempts limit and fail the vertex and DAG.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)