Andrey Klochkov created MAPREDUCE-5501:
------------------------------------------

             Summary: RMContainer Allocator loops forever after cluster 
shutdown in tests
                 Key: MAPREDUCE-5501
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5501
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: trunk
            Reporter: Andrey Klochkov


After running MR job client tests many MRAppMaster processes stay alive. The 
reason seems that RMContainer Allocator thread ignores InterruptedException and 
keeps retrying:

{code}
2013-09-09 18:52:07,505 WARN [RMCommunicator Allocator] 
org.apache.hadoop.util.ThreadUtil: interrupted while sleeping
java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:149)
        at com.sun.proxy.$Proxy29.allocate(Unknown Source)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:154)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:553)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:236)
        at java.lang.Thread.run(Thread.java:680)
2013-09-09 18:52:37,639 INFO [RMCommunicator Allocator] 
org.apache.hadoop.ipc.Client: Retrying connect to server: 
dhcpx-197-141.corp.yahoo.com/10.73.197.141:61163. Already tried 0 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
SECONDS)
2013-09-09 18:52:38,640 INFO [RMCommunicator Allocator] 
org.apache.hadoop.ipc.Client: Retrying connect to server: 
dhcpx-197-141.corp.yahoo.com/10.73.197.141:61163. Already tried 1 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
SECONDS)
{code}

It takes > 6 minutes for the processes to die, and this causes various issues 
with tests which use the same DFS dir. 

{code}
2013-09-09 22:26:47,179 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating 
with RM: Could not contact RM after 360000 milliseconds.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not contact RM 
after 360000 milliseconds.
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:563)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
        at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:236)
        at java.lang.Thread.run(Thread.java:680)
{code}

Will attach a thread dump separately. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to