[jira] [Commented] (HDDS-7925) Deadlocks among all OMs in OM HA

Kohei Sugihara (Jira) Wed, 15 Feb 2023 03:29:10 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689059#comment-17689059
 ]


Kohei Sugihara commented on HDDS-7925:
--------------------------------------

Hi [~duongnguyen], thank you for your comment. I will share the OM logs taken 
from nodeA, nodeB (leader), and nodeC.
[^om-logs.2.tar.gz]

Summary of log files is:
 * Deadlock happened at around 13:15
 * We rebooted all OM processes to resolve the deadlock around 13:49 (you can 
jump to L5001 to see)
 ** nodeA: 2023-01-27 13:49:20,079
 ** nodeB: 2023-01-27 13:49:56,139
 ** nodeC: 2023-01-27 13:50:12,234
 * The log file includes 5000 lines until the reboot, including the deadlock 
happening (13:15), and 2000 lines from the restart

Interestingly, there are only a few exceptions or ERROR logged until the 
deadlock. I've also inspected all call path that calls to the 
OzoneManager#release(Read|Write)Lock, but all of them are coded within 
try-catch-finally block. So I think we can properly release locks even if 
exception was raised. What do you think?

 
{code:java}
% grep ERROR nodeB.om.2.log
2023-01-27 12:23:26,212 [OM StateMachine ApplyTransaction Thread - 0] ERROR 
org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO: Key delete 
failed. Volume:<redacted> Bucket:<redacted> Key:<redacted>
2023-01-27 13:12:17,862 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).
2023-01-27 13:12:18,586 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).
2023-01-27 13:12:20,450 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).
2023-01-27 13:12:20,962 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).
2023-01-27 13:12:21,614 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).
2023-01-27 13:12:22,490 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).
2023-01-27 13:12:23,203 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).
2023-01-27 13:12:25,116 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).
2023-01-27 13:12:25,577 [IPC Server handler 94 on default port 9862] ERROR 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock: Thread 'IPC Server handler 94 
on default port 9862' cannot acquire VOLUME_LOCK lock while holding 
[BUCKET_LOCK] lock(s).{code}
 

 

> Deadlocks among all OMs in OM HA
> --------------------------------
>
>                 Key: HDDS-7925
>                 URL: https://issues.apache.org/jira/browse/HDDS-7925
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM HA
>    Affects Versions: 1.3.0
>         Environment: Configuration: FSO enabled, OM HA, SCM HA
>            Reporter: Kohei Sugihara
>            Assignee: Duong
>            Priority: Critical
>         Attachments: HDDS-7925-logs-2023-01-27.tar.gz, om-logs.2.tar.gz
>
>
> h2. Overview
>  * All S3 requests were timeout and not responding.
>  * No failover to the follower OMs occurs. The deadlock sticks all OMs.
>  * All OM APIs could not serve any responses by the deadlock.
>  * Switching to non-HA does not recurs the deadlock
> h2. Environment
>  * OM x3 with HA: nodeA, nodeB, nodeC
>  * SCM x3 with HA: nodeA, nodeB, nodeC
>  * DN x36: node[0:35]
>  * S3G x36: node[0:35]
> h2. Problem Details
> All requests in IPC Server Handler were stuck in a wait state, waiting to 
> acquire a bucket lock for read/write keys or to get a response from OM Ratis. 
> However, which component has a lock is unclear because the lock is spread 
> across all OMs. 
> The attached archive includes stacktraces about each OM when the deadlock 
> occurred. Each stacktraces were taken ten times (one-liner is attached in a 
> comment below) at a second interval. OM nodes are assigned the following 
> roles:
>  * Leader: nodeB
>  * Follower: nodeA, nodeC
> In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB 
> (OM leader) and two kinds of waiting threads. We checked some threads waiting 
> for a bucket lock with the jdb debugger, and all threads wait for a lock 
> about the same bucket to be freed. However, who acquired the lock is unclear.
> h3. Case #1: Waiting to acquire a bucket lock
> {quote}"IPC Server handler 33 on default port 9862" #119 daemon prio=5 
> os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 
> tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition  [0x00007f60d2214000]
> ...
>     at 
> org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
>     at 
> org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown
>  Source)
>     at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown
>  Source)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> ...
> {quote}
> h3. Case #2: Waiting for Ratis
> {quote}"IPC Server handler 35 on default port 9862" #121 daemon prio=5 
> os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 
> tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition  [0x00007f60d2014000]
> ...
>     at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
>     at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown
>  Source)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> ...
> {quote}
> h2. Workaround
> We found two ways as a quick fix:
>  * Reboot all OMs in an appropriate order; When stuck, rebooting all OMs 
> starting with the old OM leader will recover the timeout. But it recurs 
> within a few days.
>  * Switch to the non-HA; After switching to the non-HA, we have not met the 
> timeout issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7925) Deadlocks among all OMs in OM HA

Reply via email to