[jira] [Created] (HDDS-14389) [Doc] OM HA, SCM HA failover behavior

Wei-Chiu Chuang (Jira) Thu, 08 Jan 2026 22:55:08 -0800

Wei-Chiu Chuang created HDDS-14389:
--------------------------------------

             Summary: [Doc] OM HA, SCM HA failover behavior
                 Key: HDDS-14389
                 URL: https://issues.apache.org/jira/browse/HDDS-14389
             Project: Apache Ozone
          Issue Type: Task
          Components: documentation
            Reporter: Wei-Chiu Chuang



Our OM HA and SCM HA doc covers how multiple OM reaches consensus using Ratis.

However, it misses another important part of story: client failover behavior.

HadoopRpcOMFailoverProxyProvider: if client to OM is Hadoop RPC transport. The 
failover or retry may happen if (1) the OM is not not reachable, (2) not a 
leader, or (3) is a leader but not ready to accept requests.

The failover will retry up to 500 times (ozone.client.failover.max.attempts), 
and 2 seconds between each failover retry 
(ozone.client.wait.between.retries.millis). If the OM is not aware of the 
current leader, client will try the next OM in round-robin fashion; otherwise, 
client will retry contacting the current leader.

GrpcOMFailoverProxyProvider: If client to OM is gRPC transport, the behavior is 
largely the same. But I don't have much experience with it so I'll just leave 
it as.

Similarly, client (client, OM or Datanode) to SCM failover is controlled by a 
series of configuration properties in SCMClientConfig: 
hdds.scmclient.rpc.timeout, hdds.scmclient.max.retry.timeout, 
hdds.scmclient.failover.max.retry, hdds.scmclient.failover.retry.interval.

Having these behaviors documented will help users troubleshoot problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-14389) [Doc] OM HA, SCM HA failover behavior

Reply via email to