Danil Serdyuchenko created HDFS-13669:
-----------------------------------------

             Summary: YARN in HA not failing over to a new resource manager.
                 Key: HDFS-13669
                 URL: https://issues.apache.org/jira/browse/HDFS-13669
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Danil Serdyuchenko


We are running YARN in HA mode. (rm1 and rm2) We hit an issue when recreating 
one of the RMs.
 # Recreated a standby RM (rm2), which gave it a new IP
 # Stopped the active RM (rm1)
 # NMs tried to failover to rm2, but were timing out because of the old ip.
 # NMs reach the configured 30 failover retries and shutdown.

We get the following logs.
{noformat}
18/06/06 15:36:32 WARN ipc.Client: Address change detected. Old: 
yarnrm2/x.x.x.x:8031 New: yarnrm2/y.y.y.y:8031
18/06/06 15:36:32 INFO retry.RetryInvocationHandler: Exception while invoking 
nodeHeartbeat of class ResourceTrackerPBClientImpl over rm2 after 25 fail over 
attempts. Trying to fail over after sleeping for 37191ms.
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-a-a-a-a/a.a.a.a to 
yarnrm2:8031 failed on socket timeout exception: 
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while 
waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending 
remote=yarnrm2/x.x.x.x:8031]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
        at org.apache.hadoop.ipc.Client.call(Client.java:1480)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy29.nodeHeartbeat(Unknown Source)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:596)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout 
while waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending 
remote=yarnrm2.grappler.eu-west-1.prod.aws.skyscanner.local/10.51.104.136:8031]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
        at org.apache.hadoop.ipc.Client.call(Client.java:1446)
        ... 12 more{noformat}
We get this and failover back to rm1 30 times until:
{noformat}
18/06/06 15:39:44 WARN retry.RetryInvocationHandler: Exception while invoking 
class 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat
 over rm1. Not retrying because failovers (30) exceeded maximum allowed 
(30){noformat}
>From the logs it appears that the timeouts happen because it's trying to 
>connect to the old ip (x.x.x.x in the logs). Looking at the code of the Client 
>class, following the updateAddress method call we should expect a retry with 
>the new server ip ("Retrying connect to server ..." log) up to 

ipc.client.connect.max.retries.on.timeouts times. However we never see the 
retry logs and it just fails with exception. The above setting is set to 
default 45 for all of our NMs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to