Wei-Chiu Chuang created HDDS-14389:
--------------------------------------
Summary: [Doc] OM HA, SCM HA failover behavior
Key: HDDS-14389
URL: https://issues.apache.org/jira/browse/HDDS-14389
Project: Apache Ozone
Issue Type: Task
Components: documentation
Reporter: Wei-Chiu Chuang
Our OM HA and SCM HA doc covers how multiple OM reaches consensus using Ratis.
However, it misses another important part of story: client failover behavior.
HadoopRpcOMFailoverProxyProvider: if client to OM is Hadoop RPC transport. The
failover or retry may happen if (1) the OM is not not reachable, (2) not a
leader, or (3) is a leader but not ready to accept requests.
The failover will retry up to 500 times (ozone.client.failover.max.attempts),
and 2 seconds between each failover retry
(ozone.client.wait.between.retries.millis). If the OM is not aware of the
current leader, client will try the next OM in round-robin fashion; otherwise,
client will retry contacting the current leader.
GrpcOMFailoverProxyProvider: If client to OM is gRPC transport, the behavior is
largely the same. But I don't have much experience with it so I'll just leave
it as.
Similarly, client (client, OM or Datanode) to SCM failover is controlled by a
series of configuration properties in SCMClientConfig:
hdds.scmclient.rpc.timeout, hdds.scmclient.max.retry.timeout,
hdds.scmclient.failover.max.retry, hdds.scmclient.failover.retry.interval.
Having these behaviors documented will help users troubleshoot problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]