Soumitra Sulav created HDDS-6076:
------------------------------------
Summary: OM api no config to limit retries on timeouts
Key: HDDS-6076
URL: https://issues.apache.org/jira/browse/HDDS-6076
Project: Apache Ozone
Issue Type: Bug
Components: OM
Affects Versions: 1.2.0
Reporter: Soumitra Sulav
No config to address retry limits or intervals in OM api call.
This causes the client to keep on retrying forever if OM is down or there is no
leader.
Below retries are observed in all APIs as all the requests first go to OM :
{code:java}
# /opt/cloudera/parcels/CDH/bin/ozone admin om getserviceroles -id=ozone1
com.google.protobuf.ServiceException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException):
OM:om1 is not the leader. Could not determine the leader node.
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:211)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createLeaderErrorException(OzoneManagerProtocolServerSideTranslatorPB.java:198)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:191)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:150)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:124)
at
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
, while invoking $Proxy17.submitRequest over
nodeId=om1,nodeAddress=quasar-fzaxrj-3.quasar-fzaxrj.root.hwx.site:9862 after 3
failover attempts. Trying to failover immediately.
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
st-ozone-ey75a7-5gnd6/10.107.11.200 to
quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 failed on connection
exception: java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
$Proxy17.submitRequest over
nodeId=om2,nodeAddress=quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 after 4
failover attempts. Trying to failover immediately.
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
st-ozone-ey75a7-5gnd6/10.107.11.200 to
quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 failed on connection
exception: java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
$Proxy17.submitRequest over
nodeId=om3,nodeAddress=quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 after 5
failover attempts. Trying to failover after sleeping for 2000ms.
com.google.protobuf.ServiceException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException):
OM:om1 is not the leader. Could not determine the leader node.
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:211)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createLeaderErrorException(OzoneManagerProtocolServerSideTranslatorPB.java:198)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:191)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:150)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:124)
at
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
, while invoking $Proxy17.submitRequest over
nodeId=om1,nodeAddress=quasar-fzaxrj-3.quasar-fzaxrj.root.hwx.site:9862 after 6
failover attempts. Trying to failover immediately.
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
st-ozone-ey75a7-5gnd6/10.107.11.200 to
quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 failed on connection
exception: java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
$Proxy17.submitRequest over
nodeId=om2,nodeAddress=quasar-fzaxrj-5.quasar-fzaxrj.root.hwx.site:9862 after 7
failover attempts. Trying to failover immediately.
com.google.protobuf.ServiceException: java.net.ConnectException: Call From
st-ozone-ey75a7-5gnd6/10.107.11.200 to
quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 failed on connection
exception: java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
$Proxy17.submitRequest over
nodeId=om3,nodeAddress=quasar-fzaxrj-8.quasar-fzaxrj.root.hwx.site:9862 after 8
failover attempts. Trying to failover after sleeping for 2000ms.
{code}
Only config found on the client-side: {{ozone.om.client.rpc.timeout}}
https://github.com/apache/ozone/blob/master/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/conf/OMClientConfig.java#L42-L53
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]