Thomas Graves created MAPREDUCE-4214:
----------------------------------------

             Summary: nodemanager should cleanup running containers when it 
starts
                 Key: MAPREDUCE-4214
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4214
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2, nodemanager
    Affects Versions: 0.23.3
            Reporter: Thomas Graves


Currently the nodemanager doesn't cleanup running containers when it gets 
restarted. This can cause containers to get lost and stick around forever. 
We've seen this happen multiple times when the RM is restarted. When the RM is 
brought back up, it doesn't know about what was running on the cluster, it 
tells the NMs to reboot and when the NM reboots it loses what it had running. 
If there are any containers that are behaving badly there is no one left that 
knows about them to kill them.

We should kill any running containers when the nodemanager is being started.  
Note that when the NM is being brought up it needs to somehow figure out what 
containers were running and be sure it doesn't kill anything it shouldn't.
Note, we should also try to kill any running containers when the node manager 
is shutting down (jira 4213 was filed for this).

This might change a bit when RM restart is implemented if tasks can actually 
survive across RM/NM being rebooted, but that can be addressed at that point.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to