Bharat Viswanadham created HDDS-5058:
----------------------------------------
Summary: Make getScmInfo retry for a duration
Key: HDDS-5058
URL: https://issues.apache.org/jira/browse/HDDS-5058
Project: Apache Ozone
Issue Type: Bug
Reporter: Bharat Viswanadham
Assignee: Bharat Viswanadham
Previously during init of OM for getScmInfo we used to do
RetryForEverWithFixedSleep, but during SCM HA we have removed this.
This Jira proposes to add a ceration duration to try getScmInfo, instead of
retry forever with fixed sleep.
In a few docker tests, we have seen this issue, after 15 retries Om init
failed, as SCM is started later.
{code:java}
om1_1 | 2021-03-31 17:03:48,184 [main] WARN server.ServerUtils:
ozone.om.db.dirs is not configured. We recommend adding this setting. Falling
back to ozone.metadata.dirs instead.
om1_1 | 2021-03-31 17:03:52,453 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm2:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 1 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:03:54,455 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm3:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 2 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:03:56,457 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm1:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 3 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:03:58,466 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm2:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 4 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:00,498 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm3:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 5 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:02,522 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm1:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 6 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:04,533 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm2:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 7 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:06,535 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm3:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 8 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:08,537 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm1:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 9 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:10,541 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm2:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 10 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:12,543 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm3:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 11 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:14,546 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm1:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 12 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:16,550 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm2:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 13 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:18,553 [main] INFO retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
om1/172.20.0.4 to scm3:9863 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send
over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 14 failover attempts.
Trying to failover after sleeping for 2000ms.
om1_1 | 2021-03-31 17:04:20,795 [main] ERROR om.OzoneManager: Could not
initialize OM version file
om1_1 |
org.apache.hadoop.ipc.RemoteException(org.apache.ratis.protocol.exceptions.NotLeaderException):
Server 9cb7a7ae-4c40-401c-b1c6-55728c1f0907@group-C35E1BD0DE21 is not the
leader
om1_1 | at
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.triggerNotLeaderException(SCMRatisServerImpl.java:245)
om1_1 | at
org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:108)
om1_1 | at
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13874)
om1_1 | at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
om1_1 | at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
om1_1 | at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
om1_1 | at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
om1_1 | at java.base/java.security.AccessController.doPrivileged(Native
Method)
om1_1 | at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
om1_1 | at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
om1_1 | at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)
om1_1 |
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]