[ 
https://issues.apache.org/jira/browse/HDDS-6076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soumitra Sulav resolved HDDS-6076.
----------------------------------
    Resolution: Information Provided

> OM api no config to limit retries on timeouts
> ---------------------------------------------
>
>                 Key: HDDS-6076
>                 URL: https://issues.apache.org/jira/browse/HDDS-6076
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM
>    Affects Versions: 1.2.0
>            Reporter: Soumitra Sulav
>            Priority: Major
>
> No config to address retry limits or intervals in OM api call.
> This causes the client to keep on retrying forever if OM is down or there is 
> no leader.
> Below retries are observed in all APIs as all the requests first go to OM :
> {code:java}
> # /opt/cloudera/parcels/CDH/bin/ozone admin om getserviceroles -id=ozone1
> com.google.protobuf.ServiceException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException):
>  OM:om1 is not the leader. Could not determine the leader node.
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:211)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createLeaderErrorException(OzoneManagerProtocolServerSideTranslatorPB.java:198)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:191)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:150)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:124)
>     at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> , while invoking $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=quasar-fzaxrj-3.quasar-fzaxrj.root.hwx.site:9862 after 
> 3 failover attempts. Trying to failover immediately.
> com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
> st-ozone-ey75a7-5gnd6/10.107.11.200 to 
> quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 failed on connection 
> exception: java.net.ConnectException: Connection refused; For more details 
> see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> $Proxy17.submitRequest over 
> nodeId=om2,nodeAddress=quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 after 
> 4 failover attempts. Trying to failover immediately.
> com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
> st-ozone-ey75a7-5gnd6/10.107.11.200 to 
> quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 failed on connection 
> exception: java.net.ConnectException: Connection refused; For more details 
> see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> $Proxy17.submitRequest over 
> nodeId=om3,nodeAddress=quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 after 
> 5 failover attempts. Trying to failover after sleeping for 2000ms.
> com.google.protobuf.ServiceException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException):
>  OM:om1 is not the leader. Could not determine the leader node.
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:211)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createLeaderErrorException(OzoneManagerProtocolServerSideTranslatorPB.java:198)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:191)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:150)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:124)
>     at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> , while invoking $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=quasar-fzaxrj-3.quasar-fzaxrj.root.hwx.site:9862 after 
> 6 failover attempts. Trying to failover immediately.
> com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
> st-ozone-ey75a7-5gnd6/10.107.11.200 to 
> quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 failed on connection 
> exception: java.net.ConnectException: Connection refused; For more details 
> see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> $Proxy17.submitRequest over 
> nodeId=om2,nodeAddress=quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 after 
> 7 failover attempts. Trying to failover immediately.
> com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
> st-ozone-ey75a7-5gnd6/10.107.11.200 to 
> quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 failed on connection 
> exception: java.net.ConnectException: Connection refused; For more details 
> see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> $Proxy17.submitRequest over 
> nodeId=om3,nodeAddress=quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 after 
> 8 failover attempts. Trying to failover after sleeping for 2000ms. 
> {code}
> Only config found on the client-side: {{ozone.om.client.rpc.timeout}}
> https://github.com/apache/ozone/blob/master/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/conf/OMClientConfig.java#L42-L53



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to