Siddharth Seth created TEZ-3718:
-----------------------------------

             Summary: Better handling of 'bad' nodes
                 Key: TEZ-3718
                 URL: https://issues.apache.org/jira/browse/TEZ-3718
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Siddharth Seth


At the moment, the default behaviour in case of a node being marked bad is to 
do nothing other than not schedule new tasks on this node.
The alternate, via config, is to retroactively kill every task which ran on the 
node, which causes far too many unnecessary re-runs.

Proposing the following changes.
1. KILL fragments which are currently in the RUNNING state (instead of relying 
on a timeout which leads to the attempt being marked as FAILED after the 
timeout interval.
2. Keep track of these failed nodes, and use this as input to the failure 
heuristics. Normally source tasks require multiple consumers to report failure 
for them to be marked as bad. If a single consumer reports failure against a 
source which ran on a bad node, consider it bad and re-schedule immediately. 
(Otherwise failures can take a while to propagate, and jobs get a lot slower).

[~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
What I'm seeing is retroactive failures taking a long time to apply, and 
restart sources which ran on a bad node. Also running tasks being counted as 
FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to