[ https://issues.apache.org/jira/browse/TEZ-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yingda Chen reassigned TEZ-3075: -------------------------------- Assignee: Ying Han > Revamp bad node handling > ------------------------ > > Key: TEZ-3075 > URL: https://issues.apache.org/jira/browse/TEZ-3075 > Project: Apache Tez > Issue Type: Improvement > Reporter: Bikas Saha > Assignee: Ying Han > Priority: Major > > The current logic around that is derived from MR and does not work in all > cases. > Things to consider > 1) Have a notion of probation where machines are put out of service for a > period of time (say 5m, 15m and 30m) before being given up for good. This > allows more graceful handling of temporary glitches. > 2) Different handling for YARN marking a node as bad vs internal heuritics > 3) Bad nodes should not immediately trigger re-execution of completed work. > That should be based on presence of downstream consumers (ie existing demand > for that output) and a reasonable indication by other consumers from that > node that it cannot serve results (eg. multiple reports of read errors with > that node as a source). -- This message was sent by Atlassian JIRA (v7.6.3#76005)