Jason Lowe created TEZ-3198:
-------------------------------
Summary: Shuffle failures for the trailing task in a vertex are
often fatal to the entire DAG
Key: TEZ-3198
URL: https://issues.apache.org/jira/browse/TEZ-3198
Project: Apache Tez
Issue Type: Bug
Affects Versions: 0.8.2, 0.7.0
Reporter: Jason Lowe
Priority: Critical
Fix For: 0.7.1, 0.8.3
I've seen an increasing number of cases where a single-node failure caused the
whole Tez DAG to fail. These scenarios are common in that they involve the last
task of a vertex attempting to complete a shuffle where all the peer tasks have
already finished shuffling. The last task's attempt encounters errors
shuffling one of its inputs and keeps reporting it to the AM. Eventually the
attempt decides it must be the cause of the shuffle error and fails. The
subsequent attempts all do the same thing, and eventually we hit the task max
attempts limit and fail the vertex and DAG.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)