Soumitra Sulav created HDDS-6076:
------------------------------------

             Summary: OM api no config to limit retries on timeouts
                 Key: HDDS-6076
                 URL: https://issues.apache.org/jira/browse/HDDS-6076
             Project: Apache Ozone
          Issue Type: Bug
          Components: OM
    Affects Versions: 1.2.0
            Reporter: Soumitra Sulav


No config to address retry limits or intervals in OM api call.

This causes the client to keep on retrying forever if OM is down or there is no 
leader.

Below retries are observed in all APIs as all the requests first go to OM :


{code:java}

# /opt/cloudera/parcels/CDH/bin/ozone admin om getserviceroles -id=ozone1
com.google.protobuf.ServiceException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException):
 OM:om1 is not the leader. Could not determine the leader node.
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:211)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createLeaderErrorException(OzoneManagerProtocolServerSideTranslatorPB.java:198)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:191)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:150)
    at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:124)
    at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
, while invoking $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=quasar-fzaxrj-3.quasar-fzaxrj.root.hwx.site:9862 after 3 
failover attempts. Trying to failover immediately.
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
st-ozone-ey75a7-5gnd6/10.107.11.200 to 
quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
$Proxy17.submitRequest over 
nodeId=om2,nodeAddress=quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 after 4 
failover attempts. Trying to failover immediately.
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
st-ozone-ey75a7-5gnd6/10.107.11.200 to 
quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
$Proxy17.submitRequest over 
nodeId=om3,nodeAddress=quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 after 5 
failover attempts. Trying to failover after sleeping for 2000ms.
com.google.protobuf.ServiceException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException):
 OM:om1 is not the leader. Could not determine the leader node.
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:211)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createLeaderErrorException(OzoneManagerProtocolServerSideTranslatorPB.java:198)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:191)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:150)
    at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:124)
    at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
, while invoking $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=quasar-fzaxrj-3.quasar-fzaxrj.root.hwx.site:9862 after 6 
failover attempts. Trying to failover immediately.
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
st-ozone-ey75a7-5gnd6/10.107.11.200 to 
quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
$Proxy17.submitRequest over 
nodeId=om2,nodeAddress=quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 after 7 
failover attempts. Trying to failover immediately.
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
st-ozone-ey75a7-5gnd6/10.107.11.200 to 
quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
$Proxy17.submitRequest over 
nodeId=om3,nodeAddress=quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 after 8 
failover attempts. Trying to failover after sleeping for 2000ms. 

{code}

Only config found on the client-side: {{ozone.om.client.rpc.timeout}}

https://github.com/apache/ozone/blob/master/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/conf/OMClientConfig.java#L42-L53






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to