[
https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227422#comment-15227422
]
Rajesh Balamohan commented on TEZ-3198:
---------------------------------------
[~jlowe] - Would it be possible to share the task logs for the uncompleted task
and the AM log?. Can you please try the job with
"tez.runtime.shuffle.failed.check.since-last.completion=false"
> Shuffle failures for the trailing task in a vertex are often fatal to the
> entire DAG
> ------------------------------------------------------------------------------------
>
> Key: TEZ-3198
> URL: https://issues.apache.org/jira/browse/TEZ-3198
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0, 0.8.2
> Reporter: Jason Lowe
> Priority: Critical
> Fix For: 0.7.1, 0.8.3
>
>
> I've seen an increasing number of cases where a single-node failure caused
> the whole Tez DAG to fail. These scenarios are common in that they involve
> the last task of a vertex attempting to complete a shuffle where all the peer
> tasks have already finished shuffling. The last task's attempt encounters
> errors shuffling one of its inputs and keeps reporting it to the AM.
> Eventually the attempt decides it must be the cause of the shuffle error and
> fails. The subsequent attempts all do the same thing, and eventually we hit
> the task max attempts limit and fail the vertex and DAG.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)