[jira] [Updated] (HDDS-14212) S3G stuck on failover to a new leader OM

Aleksei Ieshin (Jira) Thu, 18 Dec 2025 19:00:08 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aleksei Ieshin updated HDDS-14212:
----------------------------------
    Description: 
Our ozone cluster is running with kube; we have a bunch of kube nodes, each 
node has one S3g and one DN running. Some kube nodes additionally have one OM 
or one SCM instance running. We have three OMs: om0, om1, and om2.

So, for some reason, one of the kube nodes with S3g, DN, and *om1* (leader) 
running went into a non-Ready state for a few minutes (so om1 was still running 
but didn't serve any traffic). That caused *om2* to take over the leadership. A 
few seconds later, *om1* returned to the cluster.
All S3gs failed over to the new OM leader, except one, which stuck in that 
failover attempts mode. Restarting that failing S3g helped resolve the issues.
h4. Investigation

Later, the investigation showed the following:
 # Cluster had a very low (non-default) setting that made it quickly exhaust 
its failover limits

``` 

{{"ozone.client.wait.between.retries.millis": "250"
"ozone.client.failover.max.attempts": "16"}}

```
  
`{{{}hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java{}}}
 line 93: {{private int failoverCount = 0;`}} - All threads share this counter; 
it never resets.
 * Also, in `{{{}GrpcOmTransport.shouldRetry{}}}(258)` we run `{{{}action = 
retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);`{}}} Is it 
intentional? Is it safe to do that?
 * Next in `{{{}OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction`{}}}, 
we still use that global `{{{}failoverCount{}}} checking {{{{}} if (failovers < 
maxFailovers){}}}(258)`, which always returns {{{}return 
`RetryAction.FAIL;{}}}(263)` once we reached the `{{{}maxFailovers`{}}}

Is the proposal to change `{{{}failoverCount`{}}} per request, instead of 
making it a global flag?

 

More details here: [https://github.com/apache/ozone/discussions/9477]

 

  was:
Our ozone cluster is running with kube; we have a bunch of kube nodes, each 
node has one S3g and one DN running. Some kube nodes additionally have one OM 
or one SCM instance running. We have three OMs: om0, om1, and om2.

So, for some reason, one of the kube nodes with S3g, DN, and *om1* (leader) 
running went into a non-Ready state for a few minutes (so om1 was still running 
but didn't serve any traffic). That caused *om2* to take over the leadership. A 
few seconds later, *om1* returned to the cluster.
All S3gs failed over to the new OM leader, except one, which stuck in that 
failover attempts mode. Restarting that failing S3g helped resolve the issues.
h4. Investigation

Later, the investigation showed the following:
 # Cluster had a very low (non-default) setting that made it quickly exhaust 
its failover limits

 {{"ozone.client.wait.between.retries.millis": "250"
"ozone.client.failover.max.attempts": "16"}}
  # 
{{hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java}}
 line 93: {{private int failoverCount = 0;}} - All threads share this counter; 
it never resets.

 * Also, in {{{}GrpcOmTransport.shouldRetry{}}}(258) we run {{action = 
retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);}} Is it 
intentional? Is it safe to do that?
 * Next in {{{}OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction{}}}, 
we still use that global {{failoverCount}} checking {{{} if (failovers < 
maxFailovers){}}}(258), which always returns {{{}return 
RetryAction.FAIL;{}}}(263) once we reached the {{maxFailovers}}

Proposal is to change `{{{}failoverCount`{}}} per request, instead of making it 
a global flag?

 

More details here: [https://github.com/apache/ozone/discussions/9477]

 


> S3G stuck on failover to a new leader OM
> ----------------------------------------
>
>                 Key: HDDS-14212
>                 URL: https://issues.apache.org/jira/browse/HDDS-14212
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: S3, s3gateway
>            Reporter: Aleksei Ieshin
>            Priority: Major
>
> Our ozone cluster is running with kube; we have a bunch of kube nodes, each 
> node has one S3g and one DN running. Some kube nodes additionally have one OM 
> or one SCM instance running. We have three OMs: om0, om1, and om2.
> So, for some reason, one of the kube nodes with S3g, DN, and *om1* (leader) 
> running went into a non-Ready state for a few minutes (so om1 was still 
> running but didn't serve any traffic). That caused *om2* to take over the 
> leadership. A few seconds later, *om1* returned to the cluster.
> All S3gs failed over to the new OM leader, except one, which stuck in that 
> failover attempts mode. Restarting that failing S3g helped resolve the issues.
> h4. Investigation
> Later, the investigation showed the following:
>  # Cluster had a very low (non-default) setting that made it quickly exhaust 
> its failover limits
> ``` 
> {{"ozone.client.wait.between.retries.millis": "250"
> "ozone.client.failover.max.attempts": "16"}}
> ```
>   
> `{{{}hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java{}}}
>  line 93: {{private int failoverCount = 0;`}} - All threads share this 
> counter; it never resets.
>  * Also, in `{{{}GrpcOmTransport.shouldRetry{}}}(258)` we run `{{{}action = 
> retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);`{}}} Is it 
> intentional? Is it safe to do that?
>  * Next in 
> `{{{}OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction`{}}}, we still 
> use that global `{{{}failoverCount{}}} checking {{{{}} if (failovers < 
> maxFailovers){}}}(258)`, which always returns {{{}return 
> `RetryAction.FAIL;{}}}(263)` once we reached the `{{{}maxFailovers`{}}}
> Is the proposal to change `{{{}failoverCount`{}}} per request, instead of 
> making it a global flag?
>  
> More details here: [https://github.com/apache/ozone/discussions/9477]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14212) S3G stuck on failover to a new leader OM

Reply via email to