Bharat Viswanadham created HDDS-5058:
----------------------------------------

             Summary: Make getScmInfo retry for a duration
                 Key: HDDS-5058
                 URL: https://issues.apache.org/jira/browse/HDDS-5058
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Bharat Viswanadham
            Assignee: Bharat Viswanadham


Previously during init of OM for getScmInfo we used to do 
RetryForEverWithFixedSleep, but during SCM HA we have removed this.

This Jira proposes to add a ceration duration to try getScmInfo, instead of 
retry forever with fixed sleep.

In a few docker tests, we have seen this issue, after 15 retries Om init 
failed, as SCM is started later.


{code:java}
om1_1       | 2021-03-31 17:03:48,184 [main] WARN server.ServerUtils: 
ozone.om.db.dirs is not configured. We recommend adding this setting. Falling 
back to ozone.metadata.dirs instead.
om1_1       | 2021-03-31 17:03:52,453 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm2:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 1 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:03:54,455 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm3:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 2 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:03:56,457 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm1:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 3 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:03:58,466 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm2:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 4 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:00,498 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm3:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 5 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:02,522 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm1:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 6 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:04,533 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm2:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 7 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:06,535 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm3:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 8 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:08,537 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm1:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 9 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:10,541 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm2:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 10 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:12,543 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm3:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 11 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:14,546 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm1:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 12 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:16,550 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm2:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 13 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:18,553 [main] INFO retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
om1/172.20.0.4 to scm3:9863 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send 
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 14 failover attempts. 
Trying to failover after sleeping for 2000ms.
om1_1       | 2021-03-31 17:04:20,795 [main] ERROR om.OzoneManager: Could not 
initialize OM version file
om1_1       | 
org.apache.hadoop.ipc.RemoteException(org.apache.ratis.protocol.exceptions.NotLeaderException):
 Server 9cb7a7ae-4c40-401c-b1c6-55728c1f0907@group-C35E1BD0DE21 is not the 
leader
om1_1       |   at 
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.triggerNotLeaderException(SCMRatisServerImpl.java:245)
om1_1       |   at 
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:108)
om1_1       |   at 
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13874)
om1_1       |   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
om1_1       |   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
om1_1       |   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
om1_1       |   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
om1_1       |   at java.base/java.security.AccessController.doPrivileged(Native 
Method)
om1_1       |   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
om1_1       |   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
om1_1       |   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)
om1_1       | 
{code}








--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to