Siddharth Seth created TEZ-3130:
-----------------------------------
Summary: A bad NodeManager can end up occupying all container
launcher threads, delaying new launches
Key: TEZ-3130
URL: https://issues.apache.org/jira/browse/TEZ-3130
Project: Apache Tez
Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Siddharth Seth
Fix For: 0.8.3
If there's a single bad NodeManager, and a lot of containers allocated on this
node - all container launcher threads can end up blocked on this node, delaying
subsequent launches.
This is despite timeouts kicking in.
1) We should not allow all threads to be used up for a single NM
2) The retry policy could be enhanced to stop at ConnectionTimeouts (e.g. Node
down)
3) Interrupt launch requests once Tez has detected a container as timed out.
Noticed by [~rajesh.balamohan] - threads would lockup for 15 minutes in 0.7,
and potentially infinitely on 0.8. That's another bug that needs investigation
in 0.8.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)