[ 
https://issues.apache.org/jira/browse/HDDS-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kohei Sugihara updated HDDS-7925:
---------------------------------
    Description: 
h2. Overview
 * All S3 requests were timeout and not responding.
 * No failover to the follower OMs occurs. The deadlock sticks all OMs.
 * All OM APIs could not serve any responses by the deadlock.

h2. Environment
 * OM x3 with HA: nodeA, nodeB, nodeC
 * SCM x3 with HA: nodeA, nodeB, nodeC
 * DN x36: node[0:35]
 * S3G x36: node[0:35]

h2. Problem Details

All requests in IPC Server Handler were stuck in a wait state, waiting to 
acquire a bucket lock for read/write keys or to get a response from OM Ratis. 
However, which component has a lock is unclear because the lock is spread 
across all OMs. 

The attached archive includes stacktraces about each OM when the deadlock 
occurred. Each stacktraces were taken ten times (one-liner is attached in 
comment) at a second interval. Nodes are assigned the following roles:
 * Leader: nodeB
 * Follower: nodeA, nodeC

In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB (OM 
leader) and two kinds of waiting threads. We checked some threads waiting for a 
bucket lock with the jdb debugger, and all threads wait for a lock about the 
same bucket to be freed. However, who acquired the lock is unclear.
h3. Case #1: Waiting to acquire a bucket lock
{quote}"IPC Server handler 33 on default port 9862" #119 daemon prio=5 
os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 
tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition  [0x00007f60d2214000]
...
    at 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
    at 
org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
    at 
org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
    at 
org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
    at 
org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
    at 
org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
    at org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
    at 
org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown
 Source)
    at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
    at org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown
 Source)
    at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
...
{quote}
h3. 
Case #2: Waiting for Ratis
{quote}"IPC Server handler 35 on default port 9862" #121 daemon prio=5 
os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 
tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition  [0x00007f60d2014000]
...
    at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
    at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown
 Source)
    at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
...
{quote}
h2. Workaround

We found two ways as a quick fix:
 * Reboot all OMs in an appropriate order; When stuck, rebooting all OMs 
starting with the old OM leader will recover the timeout. But it recurs within 
a few days.
 * Switch to the non-HA; After switching to the non-HA, we have not met the 
timeout issue.

  was:In our environment, from December 2022 to January 2023, we met a timeout 
problem several times within the window; all OMs stopped responding to OM RPCs, 
such as listing keys, getting keys, and getting OM roles. In all cases, 
rebooting OMs in the appropriate order will recover the service, but every time 
it recurred within a few days. We dug logs and traces about all OMs and noticed 
IPC Server Handler in every OM is full of OM requests waiting for either 
OzoneManagerLock or Ratis, including OM Ratis. Switching the OMs to non-HA and 
running for two weeks, the problem has not recurred.


> Potential deadlocks among all OMs in OM HA
> ------------------------------------------
>
>                 Key: HDDS-7925
>                 URL: https://issues.apache.org/jira/browse/HDDS-7925
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM HA
>    Affects Versions: 1.3.0
>         Environment: Configuration: FSO enabled, OM HA, SCM HA
>            Reporter: Kohei Sugihara
>            Priority: Major
>         Attachments: HDDS-7925-logs-2023-01-27.tar.gz
>
>
> h2. Overview
>  * All S3 requests were timeout and not responding.
>  * No failover to the follower OMs occurs. The deadlock sticks all OMs.
>  * All OM APIs could not serve any responses by the deadlock.
> h2. Environment
>  * OM x3 with HA: nodeA, nodeB, nodeC
>  * SCM x3 with HA: nodeA, nodeB, nodeC
>  * DN x36: node[0:35]
>  * S3G x36: node[0:35]
> h2. Problem Details
> All requests in IPC Server Handler were stuck in a wait state, waiting to 
> acquire a bucket lock for read/write keys or to get a response from OM Ratis. 
> However, which component has a lock is unclear because the lock is spread 
> across all OMs. 
> The attached archive includes stacktraces about each OM when the deadlock 
> occurred. Each stacktraces were taken ten times (one-liner is attached in 
> comment) at a second interval. Nodes are assigned the following roles:
>  * Leader: nodeB
>  * Follower: nodeA, nodeC
> In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB 
> (OM leader) and two kinds of waiting threads. We checked some threads waiting 
> for a bucket lock with the jdb debugger, and all threads wait for a lock 
> about the same bucket to be freed. However, who acquired the lock is unclear.
> h3. Case #1: Waiting to acquire a bucket lock
> {quote}"IPC Server handler 33 on default port 9862" #119 daemon prio=5 
> os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 
> tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition  [0x00007f60d2214000]
> ...
>     at 
> org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
>     at 
> org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown
>  Source)
>     at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown
>  Source)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> ...
> {quote}
> h3. 
> Case #2: Waiting for Ratis
> {quote}"IPC Server handler 35 on default port 9862" #121 daemon prio=5 
> os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 
> tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition  [0x00007f60d2014000]
> ...
>     at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
>     at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown
>  Source)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> ...
> {quote}
> h2. Workaround
> We found two ways as a quick fix:
>  * Reboot all OMs in an appropriate order; When stuck, rebooting all OMs 
> starting with the old OM leader will recover the timeout. But it recurs 
> within a few days.
>  * Switch to the non-HA; After switching to the non-HA, we have not met the 
> timeout issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to