[ 
https://issues.apache.org/jira/browse/HDDS-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gargi Jaiswal reassigned HDDS-14389:
------------------------------------

    Assignee: Gargi Jaiswal

> [Website v2] [Docs] [Administrator Guide] OM HA, SCM HA failover behavior
> -------------------------------------------------------------------------
>
>                 Key: HDDS-14389
>                 URL: https://issues.apache.org/jira/browse/HDDS-14389
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: documentation
>            Reporter: Wei-Chiu Chuang
>            Assignee: Gargi Jaiswal
>            Priority: Major
>
> Our OM HA and SCM HA doc covers how multiple OM reaches consensus using Ratis.
> However, it misses another important part of story: client failover behavior.
> HadoopRpcOMFailoverProxyProvider: if client to OM is Hadoop RPC transport. 
> The failover or retry may happen if (1) the OM is not not reachable, (2) not 
> a leader, or (3) is a leader but not ready to accept requests.
> The failover will retry up to 500 times (ozone.client.failover.max.attempts), 
> and 2 seconds between each failover retry 
> (ozone.client.wait.between.retries.millis). If the OM is not aware of the 
> current leader, client will try the next OM in round-robin fashion; 
> otherwise, client will retry contacting the current leader.
> Additionally, it is crucial to ensure clients and OM have consistent node 
> mapping configurations, otherwise failover may not reach the leader OM.
> GrpcOMFailoverProxyProvider: If client to OM is gRPC transport, the behavior 
> is largely the same. But I don't have much experience with it so I'll just 
> leave it as.
> Similarly, client (client, OM or Datanode) to SCM failover is controlled by a 
> series of configuration properties in SCMClientConfig: 
> hdds.scmclient.rpc.timeout, hdds.scmclient.max.retry.timeout, 
> hdds.scmclient.failover.max.retry, hdds.scmclient.failover.retry.interval.
> Having these behaviors documented will help users troubleshoot problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to