[
https://issues.apache.org/jira/browse/HDDS-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksei Ieshin updated HDDS-14212:
----------------------------------
Description:
Our ozone cluster is running with kube; we have a bunch of kube nodes, each
node has one S3g and one DN running. Some kube nodes additionally have one OM
or one SCM instance running. We have three OMs: om0, om1, and om2.
So, for some reason, one of the kube nodes with S3g, DN, and *om1* (leader)
running went into a non-Ready state for a few minutes (so om1 was still running
but didn't serve any traffic). That caused *om2* to take over the leadership. A
few seconds later, *om1* returned to the cluster.
All S3gs failed over to the new OM leader, except one, which stuck in that
failover attempts mode. Restarting that failing S3g helped resolve the issues.
h4. Investigation
Later, the investigation showed the following:
# Cluster had a very low (non-default) setting that made it quickly exhaust
its failover limits
{code:java}
"ozone.client.wait.between.retries.millis": "250"
"ozone.client.failover.max.attempts": "16"
{code}
2.
`hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java`
line 93: `private int failoverCount = 0;` - All threads share this counter; it
never resets.
- Also, in `GrpcOmTransport.shouldRetry`(258) we run `action =
retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);` Is it
intentional? Is it safe to do that?
- Next in `OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction`, we still
use that global `failoverCount` checking ` if (failovers < maxFailovers)`(258),
which always returns `return RetryAction.FAIL;`(263) once we reached the
`maxFailovers`
- Shouldn't we have the `failoverCount` per request or per thread instead of
making it a global flag? Or should we reset it?
Is the proposal to change `{{{}failoverCount`{}}} per request, instead of
making it a global flag?
More details here: [https://github.com/apache/ozone/discussions/9477]
was:
Our ozone cluster is running with kube; we have a bunch of kube nodes, each
node has one S3g and one DN running. Some kube nodes additionally have one OM
or one SCM instance running. We have three OMs: om0, om1, and om2.
So, for some reason, one of the kube nodes with S3g, DN, and *om1* (leader)
running went into a non-Ready state for a few minutes (so om1 was still running
but didn't serve any traffic). That caused *om2* to take over the leadership. A
few seconds later, *om1* returned to the cluster.
All S3gs failed over to the new OM leader, except one, which stuck in that
failover attempts mode. Restarting that failing S3g helped resolve the issues.
h4. Investigation
Later, the investigation showed the following:
# Cluster had a very low (non-default) setting that made it quickly exhaust
its failover limits
{code:java}
"ozone.client.wait.between.retries.millis": "250"
"ozone.client.failover.max.attempts": "16"
{code}
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java{}}}
line 93: {{private int failoverCount = 0;`}} - All threads share this counter;
it never resets.
* Also, in `{{{}GrpcOmTransport.shouldRetry{}}}(258)` we run `{{{}action =
retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);`{}}} Is it
intentional? Is it safe to do that?
* Next in `{{{}OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction`{}}},
we still use that global `{{{}failoverCount{}}} checking {{{{}} if (failovers <
maxFailovers){}}}(258)`, which always returns {{{}return
`RetryAction.FAIL;{}}}(263)` once we reached the `{{{}maxFailovers`{}}}
Is the proposal to change `{{{}failoverCount`{}}} per request, instead of
making it a global flag?
More details here: [https://github.com/apache/ozone/discussions/9477]
> S3G stuck on failover to a new leader OM
> ----------------------------------------
>
> Key: HDDS-14212
> URL: https://issues.apache.org/jira/browse/HDDS-14212
> Project: Apache Ozone
> Issue Type: Bug
> Components: S3, s3gateway
> Reporter: Aleksei Ieshin
> Priority: Major
>
> Our ozone cluster is running with kube; we have a bunch of kube nodes, each
> node has one S3g and one DN running. Some kube nodes additionally have one OM
> or one SCM instance running. We have three OMs: om0, om1, and om2.
> So, for some reason, one of the kube nodes with S3g, DN, and *om1* (leader)
> running went into a non-Ready state for a few minutes (so om1 was still
> running but didn't serve any traffic). That caused *om2* to take over the
> leadership. A few seconds later, *om1* returned to the cluster.
> All S3gs failed over to the new OM leader, except one, which stuck in that
> failover attempts mode. Restarting that failing S3g helped resolve the issues.
> h4. Investigation
> Later, the investigation showed the following:
> # Cluster had a very low (non-default) setting that made it quickly exhaust
> its failover limits
>
>
> {code:java}
> "ozone.client.wait.between.retries.millis": "250"
> "ozone.client.failover.max.attempts": "16"
> {code}
> 2.
> `hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java`
> line 93: `private int failoverCount = 0;` - All threads share this counter;
> it never resets.
> - Also, in `GrpcOmTransport.shouldRetry`(258) we run `action =
> retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);` Is it
> intentional? Is it safe to do that?
> - Next in `OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction`, we
> still use that global `failoverCount` checking ` if (failovers <
> maxFailovers)`(258), which always returns `return RetryAction.FAIL;`(263)
> once we reached the `maxFailovers`
> - Shouldn't we have the `failoverCount` per request or per thread instead of
> making it a global flag? Or should we reset it?
> Is the proposal to change `{{{}failoverCount`{}}} per request, instead of
> making it a global flag?
>
> More details here: [https://github.com/apache/ozone/discussions/9477]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]