[ 
https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226974#comment-15226974
 ] 

Jason Lowe commented on TEZ-3198:
---------------------------------

The issue is that the AM never gets around to re-running the upstream task.  
Part of the issue is that errors reported from a task are aggregated by the 
reporting task attempt, so no matter how often an attempt reports an error 
against another task, it will only be recorded as a single failure when 
calculating whether to re-run the task.  As I understand it, the algorithm for 
re-running a task due to fetch failures currently works like this:
- if it has been 5 minutes (default) since the first time the reporting attempt 
flagged an error then re-run the blamed task
- if the number of unique reporting attempts is more than 10% (default) of the 
total number of downstream tasks then re-run the blamed task
- if the total number of unique reporting attempts is >= 10 (default) then 
re-run the blamed task
- otherwise do not re-run the blamed task

Since there's only one task trying to shuffle, there is by default only 4 
attempts maximum.  This means the task cannot alone clear the total number of 
unique reporting attempts nor can it clear the 10% hurdle if there are a decent 
number of tasks in its vertex.  That leaves just the 5 minute period, but often 
the task attempt is giving up on its own due to the number of failures before 
it gets past the 5 minute mark.  This is especially true if the failure is fast 
such as connection refused, invalid shuffle secret, missing map output on the 
node, etc.

A typical scenario that can lead to this is a lone vertex needing to be re-run 
due to shuffle errors.  That lone task needs to shuffle again, long after its 
peer tasks have completed, and in the interim a node has failed in some way 
(not necessarily recognized by YARN yet).  Now we're left with a lone task 
trying to complete a shuffle that cannot succeed since the attempts always give 
up before the AM decides to re-run the upstream task.

> Shuffle failures for the trailing task in a vertex are often fatal to the 
> entire DAG
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-3198
>                 URL: https://issues.apache.org/jira/browse/TEZ-3198
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.8.2
>            Reporter: Jason Lowe
>            Priority: Critical
>             Fix For: 0.7.1, 0.8.3
>
>
> I've seen an increasing number of cases where a single-node failure caused 
> the whole Tez DAG to fail. These scenarios are common in that they involve 
> the last task of a vertex attempting to complete a shuffle where all the peer 
> tasks have already finished shuffling.  The last task's attempt encounters 
> errors shuffling one of its inputs and keeps reporting it to the AM.  
> Eventually the attempt decides it must be the cause of the shuffle error and 
> fails.  The subsequent attempts all do the same thing, and eventually we hit 
> the task max attempts limit and fail the vertex and DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to