[
https://issues.apache.org/jira/browse/HDDS-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718902#comment-17718902
]
Sumit Agrawal commented on HDDS-8366:
-------------------------------------
[~wanghongbing] There is a similar issue,
https://issues.apache.org/jira/browse/HDDS-7925 where this situation can happen
as release lock was not proper in case of failure on getting bucket info.
So in this case as observed from stack, applyTransaction is waiting for Write
lock and got stuck. Other threads for submitToRatis gets block.
Just to know is there any error log or failure observed in the env (if
available) to confirm if same issue?
> OzoneManager hangs when submitRequestToRatis
> --------------------------------------------
>
> Key: HDDS-8366
> URL: https://issues.apache.org/jira/browse/HDDS-8366
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM, Ozone Manager
> Affects Versions: 1.3.0
> Reporter: Hongbing Wang
> Assignee: Sumit Agrawal
> Priority: Critical
> Attachments: om.abnormal.jstack, om.normal.jstack, om_rpc_callqueue_
> accumulation.png
>
>
> OM all rpc handlers hang when calling
> `OzoneManagerRatisServer#submitRequestToRatis`, the key stack as follows:
> {noformat}
> "IPC Server handler 99 on 9862" #187 daemon prio=5 os_prio=0
> tid=0x00007f1897b4c000 nid=0x10fa63 waiting on condition [0x00007f05a5b48000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007f08a185e050> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$696/251832800.apply(Unknown
> Source)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2716)
> Locked ownable synchronizers:
> - None
> {noformat}
> The complete abnormal stack see: [^om.abnormal.jstack] (also see [web
> link|https://github.com/whbing/issue_logs/blob/main/ozone/omrpc20230323/om.abnormal.jstack])
> Compare the normal stack see: [^om.normal.jstack] (also see [web
> link|https://github.com/whbing/issue_logs/blob/main/ozone/omrpc20230323/om.normal.jstack])
> ipc debug log as follow:
> {noformat}
> 2023-03-22 13:17:56,135 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: Successfully authorized userInfo {
> effectiveUser: "xxx"
> }
> protocol: "org.apache.hadoop.hdds.protocol.GenericRefreshProtocol"
> 2023-03-22 13:17:56,135 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: got #0
> 2023-03-22 13:17:57,143 [IPC Server idle connection scanner for port 9862]
> DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for
> port 9862: task running
> 2023-03-22 13:17:57,946 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: got #-4
> 2023-03-22 13:17:57,946 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: Received ping message
> 2023-03-22 13:18:07,143 [IPC Server idle connection scanner for port 9862]
> DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for
> port 9862: task running
> 2023-03-22 13:18:13,536 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: got #-4
> 2023-03-22 13:18:13,536 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: Received ping message
> {noformat}
> RPCs are backlogged in callQueue:
> !om_rpc_callqueue_ accumulation.png!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]