Gopal V created TEZ-2145:
----------------------------
Summary: Testing: Cover all failure scenarios in Pipelined data
movement
Key: TEZ-2145
URL: https://issues.apache.org/jira/browse/TEZ-2145
Project: Apache Tez
Issue Type: Sub-task
Reporter: Gopal V
Assignee: Rajesh Balamohan
The failure scenarios for the downstream tasks in pipelined case that are
required for all cases with deterministic spill boundaries.
The primary exit criteria for this is to ensure that all success scenarios
produce the correct result.
The issue is correctness - the only performant case in consideration is a
failure-free scenario. All incorrect result (or even suspect) scenarios should
result in failures which are handled by Tez's task failure tolerance safety net.
The introduced window of error with pipelined exists between the first event
and the last event, all other time frames (as observed) in the downstream
vertex is irrelevant.
The only task failures that matter to a downstream task are the ones which have
sent it events already - this specifically does not honor any of the empty
partition bitset optimizationsl, the event is counted as the empty flag might
be an error in execution.
A task which fails before the first spill communication is not relevant to any
of the criteria below.
Known suspect scenarios
1) A task fails and the result becomes obsolete
2) A node gets black-listed and all data on the shuffle directory is obsoleted
3) No task which has sent out a DME is speculated
4) To check for #3, two events with the same src and same spill index is
received in the same reducer
In all these scenarios, the downstream vertex (on a ordered or un-ordered
edges) has to lose all state and fail.
These failures are considered against the task's maximum retries, to prevent a
sequence of failures from cascading & eating up capacity on a cluster.
Some of these corner cases do have performant error correcting measures, which
rely on a reducer check-point to hold onto state and continue the reducer
execution after the upstream retries have been executed.
Those scenarios are not part of this testing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)