[
https://issues.apache.org/jira/browse/TEZ-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddharth Seth updated TEZ-965:
-------------------------------
Target Version/s: 0.7.0
> Tez needs a "circuit-breaker" to avoid mistaking network blips to task/node
> failures
> ------------------------------------------------------------------------------------
>
> Key: TEZ-965
> URL: https://issues.apache.org/jira/browse/TEZ-965
> Project: Apache Tez
> Issue Type: Bug
> Environment: Flaky DNS cluster
> Reporter: Gopal V
>
> If DNS resolution fails for a period of 5-10 seconds, Tez restarts &
> contra-flows in the query triggering recovery of nearly everything it has run.
> Nodes are getting marked as bad because they can't shuffle (dns resolution
> failed for all NMs), which results in log lines like
> {code}
> attempt_1394928384313_0234_1_25_000654_0 blamed for read error from
> attempt_1394928384313_0234_1_24_000366_0
> {code}
> And the tasks restart from an earlier vertex.
> When a large number of such failures happen, the tasks shouldn't restart
> previous vertexes, but instead should flip a circuit & back-off till the
> network blip disappears.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)