[jira] [Commented] (HDDS-7925) Deadlocks among all OMs in OM HA

Duong (Jira) Wed, 15 Feb 2023 10:29:07 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689299#comment-17689299
 ]


Duong commented on HDDS-7925:
-----------------------------

Thanks for the logs, [~ksugihara], we see this error in the logs, only for the 
thread named "IPC Server handler 94 on default port 9862".

 
{code:java}
2023-01-27 13:12:17,863 [IPC Server handler 94 on default port 9862] WARN 
org.apache.hadoop.ipc.Server: IPC Server handler 94 on default port 9862, call 
Call#3631736 Retry#0 
org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
<ip-for-node30>:38821
java.lang.RuntimeException: Thread 'IPC Server handler 94 on default port 9862' 
cannot acquire VOLUME_LOCK lock while holding [BUCKET_LOCK] lock(s).
    at 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock.lock(OzoneManagerLock.java:185)
    at 
org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
    at 
org.apache.hadoop.ozone.om.OzoneManager.getVolumeOwner(OzoneManager.java:2446)
    at 
org.apache.hadoop.ozone.om.OzoneManager.getVolumeOwner(OzoneManager.java:2440)
    at org.apache.hadoop.ozone.om.OzoneManager.checkAcls(OzoneManager.java:2395)
    at 
org.apache.hadoop.ozone.om.OzoneManager.getVolumeInfo(OzoneManager.java:2610)
    at 
org.apache.hadoop.ozone.om.OzoneManager.getS3VolumeContext(OzoneManager.java:3493)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getS3VolumeContext(OzoneManagerRequestHandler.java:1215)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:262)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
    at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
    at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
....{code}
It indicates that this thread "IPC Server handler 94 on default port 9862" 
somehow "forgot" to lease the BUCKET lock in one of its previous operations. 

We recently detected and fixed a similar issue as per HDDS-7755, which was 
found when we tested in a multi-tenancy context. It happens when in 
Question: did you run the cluster in a multi-tenancy setup? If so, did you 
encounter similar errors like "bucket not found" for read ops?

 

However, looking at your stack, many read operation is waiting for the BUCKET 
lock, hence, the leaked one must be a Write (BUCKET WRITE) lock. 

With that assumption, I'd suggest looking at OM audit logs. My bet is there's 
some failing write request there before the deadlock and it'd be a good 
indicator. [~ksugihara] 
I know that all the acquired locks are close in a try-catch-finally block, but 
the evidence has strongly suggested a leak. Somewhere in the code fails to 
release a bucket (write) lock, and it can be because releasing write lock is 
not the first statement in the finally blocks, yet just a wild guess. 

> Deadlocks among all OMs in OM HA
> --------------------------------
>
>                 Key: HDDS-7925
>                 URL: https://issues.apache.org/jira/browse/HDDS-7925
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM HA
>    Affects Versions: 1.3.0
>         Environment: Configuration: FSO enabled, OM HA, SCM HA
>            Reporter: Kohei Sugihara
>            Assignee: Duong
>            Priority: Critical
>         Attachments: HDDS-7925-logs-2023-01-27.tar.gz, om-logs.2.tar.gz
>
>
> h2. Overview
>  * All S3 requests were timeout and not responding.
>  * No failover to the follower OMs occurs. The deadlock sticks all OMs.
>  * All OM APIs could not serve any responses by the deadlock.
>  * Switching to non-HA does not recurs the deadlock
> h2. Environment
>  * OM x3 with HA: nodeA, nodeB, nodeC
>  * SCM x3 with HA: nodeA, nodeB, nodeC
>  * DN x36: node[0:35]
>  * S3G x36: node[0:35]
> h2. Problem Details
> All requests in IPC Server Handler were stuck in a wait state, waiting to 
> acquire a bucket lock for read/write keys or to get a response from OM Ratis. 
> However, which component has a lock is unclear because the lock is spread 
> across all OMs. 
> The attached archive includes stacktraces about each OM when the deadlock 
> occurred. Each stacktraces were taken ten times (one-liner is attached in a 
> comment below) at a second interval. OM nodes are assigned the following 
> roles:
>  * Leader: nodeB
>  * Follower: nodeA, nodeC
> In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB 
> (OM leader) and two kinds of waiting threads. We checked some threads waiting 
> for a bucket lock with the jdb debugger, and all threads wait for a lock 
> about the same bucket to be freed. However, who acquired the lock is unclear.
> h3. Case #1: Waiting to acquire a bucket lock
> {quote}"IPC Server handler 33 on default port 9862" #119 daemon prio=5 
> os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 
> tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition  [0x00007f60d2214000]
> ...
>     at 
> org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
>     at 
> org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown
>  Source)
>     at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown
>  Source)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> ...
> {quote}
> h3. Case #2: Waiting for Ratis
> {quote}"IPC Server handler 35 on default port 9862" #121 daemon prio=5 
> os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 
> tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition  [0x00007f60d2014000]
> ...
>     at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
>     at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
>     at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown
>  Source)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> ...
> {quote}
> h2. Workaround
> We found two ways as a quick fix:
>  * Reboot all OMs in an appropriate order; When stuck, rebooting all OMs 
> starting with the old OM leader will recover the timeout. But it recurs 
> within a few days.
>  * Switch to the non-HA; After switching to the non-HA, we have not met the 
> timeout issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7925) Deadlocks among all OMs in OM HA

Reply via email to