Bikas Saha created TEZ-3075:
-------------------------------

             Summary: Revamp bad node handling
                 Key: TEZ-3075
                 URL: https://issues.apache.org/jira/browse/TEZ-3075
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Bikas Saha


The current logic around that is derived from MR and does not work in all cases.
Things to consider
1) Have a notion of probation where machines are put out of service for a 
period of time (say 5m, 15m and 30m) before being given up for good. This 
allows more graceful handling of temporary glitches.
2) Different handling for YARN marking a node as bad vs internal heuritics
3) Bad nodes should not immediately trigger re-execution of completed work. 
That should be based on presence of downstream consumers (ie existing demand 
for that output) and a reasonable indication by other consumers from that node 
that it cannot serve results (eg. multiple reports of read errors with that 
node as a source).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to