Gopal V created TEZ-2145:
----------------------------

             Summary: Testing: Cover all failure scenarios in Pipelined data 
movement
                 Key: TEZ-2145
                 URL: https://issues.apache.org/jira/browse/TEZ-2145
             Project: Apache Tez
          Issue Type: Sub-task
            Reporter: Gopal V
            Assignee: Rajesh Balamohan


The failure scenarios for the downstream tasks in pipelined case that are 
required for all cases with deterministic spill boundaries.

The primary exit criteria for this is to ensure that all success scenarios 
produce the correct result.

The issue is correctness - the only performant case in consideration is a 
failure-free scenario. All incorrect result (or even suspect) scenarios should 
result in failures which are handled by Tez's task failure tolerance safety net.

The introduced window of error with pipelined exists between the first event 
and the last event, all other time frames (as observed) in the downstream 
vertex is irrelevant.

The only task failures that matter to a downstream task are the ones which have 
sent it events already - this specifically does not honor any of the empty 
partition bitset optimizationsl, the event is counted as the empty flag might 
be an error in execution.

A task which fails before the first spill communication is not relevant to any 
of the criteria below.

Known suspect scenarios

1) A task fails and the result becomes obsolete 
2) A node gets black-listed and all data on the shuffle directory is obsoleted
3) No task which has sent out a DME is speculated 
4) To check for #3, two events with the same src and same spill index is 
received in the same reducer

In all these scenarios, the downstream vertex (on a ordered or un-ordered 
edges) has to lose all state and fail.

These failures are considered against the task's maximum retries, to prevent a 
sequence of failures from cascading & eating up capacity on a cluster.

Some of these corner cases do have performant error correcting measures, which 
rely on a reducer check-point to hold onto state and continue the reducer 
execution after the upstream retries have been executed.

Those scenarios are not part of this testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to