[
https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226974#comment-15226974
]
Jason Lowe commented on TEZ-3198:
---------------------------------
The issue is that the AM never gets around to re-running the upstream task.
Part of the issue is that errors reported from a task are aggregated by the
reporting task attempt, so no matter how often an attempt reports an error
against another task, it will only be recorded as a single failure when
calculating whether to re-run the task. As I understand it, the algorithm for
re-running a task due to fetch failures currently works like this:
- if it has been 5 minutes (default) since the first time the reporting attempt
flagged an error then re-run the blamed task
- if the number of unique reporting attempts is more than 10% (default) of the
total number of downstream tasks then re-run the blamed task
- if the total number of unique reporting attempts is >= 10 (default) then
re-run the blamed task
- otherwise do not re-run the blamed task
Since there's only one task trying to shuffle, there is by default only 4
attempts maximum. This means the task cannot alone clear the total number of
unique reporting attempts nor can it clear the 10% hurdle if there are a decent
number of tasks in its vertex. That leaves just the 5 minute period, but often
the task attempt is giving up on its own due to the number of failures before
it gets past the 5 minute mark. This is especially true if the failure is fast
such as connection refused, invalid shuffle secret, missing map output on the
node, etc.
A typical scenario that can lead to this is a lone vertex needing to be re-run
due to shuffle errors. That lone task needs to shuffle again, long after its
peer tasks have completed, and in the interim a node has failed in some way
(not necessarily recognized by YARN yet). Now we're left with a lone task
trying to complete a shuffle that cannot succeed since the attempts always give
up before the AM decides to re-run the upstream task.
> Shuffle failures for the trailing task in a vertex are often fatal to the
> entire DAG
> ------------------------------------------------------------------------------------
>
> Key: TEZ-3198
> URL: https://issues.apache.org/jira/browse/TEZ-3198
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0, 0.8.2
> Reporter: Jason Lowe
> Priority: Critical
> Fix For: 0.7.1, 0.8.3
>
>
> I've seen an increasing number of cases where a single-node failure caused
> the whole Tez DAG to fail. These scenarios are common in that they involve
> the last task of a vertex attempting to complete a shuffle where all the peer
> tasks have already finished shuffling. The last task's attempt encounters
> errors shuffling one of its inputs and keeps reporting it to the AM.
> Eventually the attempt decides it must be the cause of the shuffle error and
> fails. The subsequent attempts all do the same thing, and eventually we hit
> the task max attempts limit and fail the vertex and DAG.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)