[
https://issues.apache.org/jira/browse/HDDS-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708503#comment-17708503
]
Sumit Agrawal commented on HDDS-8366:
-------------------------------------
[~wanghongbing]
>From abnormal stack, there is a pilling up of request to be executed by ratis.
>There is a possibility that the execution of request is slow causing this.
Can not find any blocking lock or any other info which can make it to block.
Please check the test case you are running is overload test ? Or getting reply
slowly for request send ?
> OzoneManager hangs when submitRequestToRatis
> --------------------------------------------
>
> Key: HDDS-8366
> URL: https://issues.apache.org/jira/browse/HDDS-8366
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM, Ozone Manager
> Affects Versions: 1.3.0
> Reporter: Hongbing Wang
> Priority: Major
> Attachments: om.abnormal.jstack, om.normal.jstack, om_rpc_callqueue_
> accumulation.png
>
>
> OM all rpc handlers hang when calling
> `OzoneManagerRatisServer#submitRequestToRatis`, the key stack as follows:
> {noformat}
> "IPC Server handler 99 on 9862" #187 daemon prio=5 os_prio=0
> tid=0x00007f1897b4c000 nid=0x10fa63 waiting on condition [0x00007f05a5b48000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007f08a185e050> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$696/251832800.apply(Unknown
> Source)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2716)
> Locked ownable synchronizers:
> - None
> {noformat}
> The complete abnormal stack see: [^om.abnormal.jstack] (also see [web
> link|https://github.com/whbing/issue_logs/blob/main/ozone/omrpc20230323/om.abnormal.jstack])
> Compare the normal stack see: [^om.normal.jstack] (also see [web
> link|https://github.com/whbing/issue_logs/blob/main/ozone/omrpc20230323/om.normal.jstack])
> ipc debug log as follow:
> {noformat}
> 2023-03-22 13:17:56,135 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: Successfully authorized userInfo {
> effectiveUser: "xxx"
> }
> protocol: "org.apache.hadoop.hdds.protocol.GenericRefreshProtocol"
> 2023-03-22 13:17:56,135 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: got #0
> 2023-03-22 13:17:57,143 [IPC Server idle connection scanner for port 9862]
> DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for
> port 9862: task running
> 2023-03-22 13:17:57,946 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: got #-4
> 2023-03-22 13:17:57,946 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: Received ping message
> 2023-03-22 13:18:07,143 [IPC Server idle connection scanner for port 9862]
> DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for
> port 9862: task running
> 2023-03-22 13:18:13,536 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: got #-4
> 2023-03-22 13:18:13,536 [Socket Reader #1 for port 9862] DEBUG
> org.apache.hadoop.ipc.Server: Received ping message
> {noformat}
> RPCs are backlogged in callQueue:
> !om_rpc_callqueue_ accumulation.png!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]