[jira] [Commented] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG

Rajesh Balamohan (JIRA) Thu, 07 Apr 2016 07:42:45 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230335#comment-15230335
 ]


Rajesh Balamohan commented on TEZ-3198:
---------------------------------------

That is correct. I took the example of 500 internal server error as shown in 
the logs.  In all of these cases, it would have to rely on the penalized host 
delays. It would be good to have a better value for 
"tez.runtime.shuffle.max.allowed.failed.fetch.fraction" (which defaults to 0.5 
now). Setting to 0.95 should help as well. Setting 
"tez.runtime.shuffle.failed.check.since-last.completion=false" should directly 
disable this codepath and help in terms of validating that the DAGs are able to 
complete successfully. 

> Shuffle failures for the trailing task in a vertex are often fatal to the 
> entire DAG
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-3198
>                 URL: https://issues.apache.org/jira/browse/TEZ-3198
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.8.2
>            Reporter: Jason Lowe
>            Priority: Critical
>
> I've seen an increasing number of cases where a single-node failure caused 
> the whole Tez DAG to fail. These scenarios are common in that they involve 
> the last task of a vertex attempting to complete a shuffle where all the peer 
> tasks have already finished shuffling.  The last task's attempt encounters 
> errors shuffling one of its inputs and keeps reporting it to the AM.  
> Eventually the attempt decides it must be the cause of the shuffle error and 
> fails.  The subsequent attempts all do the same thing, and eventually we hit 
> the task max attempts limit and fail the vertex and DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG

Reply via email to