[ 
https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229416#comment-15229416
 ] 

Rajesh Balamohan commented on TEZ-3198:
---------------------------------------

Thanks [~jlowe] for sharing the details. Since it is "500 internal server", it 
would not have gone through any of the connect/read timeouts and would have to 
rely on host penalization delays. However, within 8 retries fetcher health was 
set to unhealthy ("fetcherHealthy=false, failedShufflesSinceLastCompletion=8, 
remainingMaps=2") as  "tez.runtime.shuffle.max.allowed.failed.fetch.fraction" 
defaults to 0.5. This determines whether fetcher has to be marked unhealthy 
based on the ratio of (failure/failure+remainingMaps). Please note that failure 
over here means the failure count since last successful download.

So within 8 failures of last successful download, shuffle task was marked 
unhealthy and same pattern would have happened for any of the retried attempts 
as well.

In this case, setting 
"tez.runtime.shuffle.failed.check.since-last.completion=false" would help as 
the fetcher would continue to run & report failures to AM (and source task 
would be restarted by AM after timeout).


> Shuffle failures for the trailing task in a vertex are often fatal to the 
> entire DAG
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-3198
>                 URL: https://issues.apache.org/jira/browse/TEZ-3198
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.8.2
>            Reporter: Jason Lowe
>            Priority: Critical
>
> I've seen an increasing number of cases where a single-node failure caused 
> the whole Tez DAG to fail. These scenarios are common in that they involve 
> the last task of a vertex attempting to complete a shuffle where all the peer 
> tasks have already finished shuffling.  The last task's attempt encounters 
> errors shuffling one of its inputs and keeps reporting it to the AM.  
> Eventually the attempt decides it must be the cause of the shuffle error and 
> fails.  The subsequent attempts all do the same thing, and eventually we hit 
> the task max attempts limit and fail the vertex and DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to