[ 
https://issues.apache.org/jira/browse/HDDS-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719197#comment-17719197
 ] 

Sumit Agrawal commented on HDDS-8366:
-------------------------------------

IMO, from single stack, can not confirm if same issue, possibility for hang,
 # Lock Issue: (most probable cause): ApplyTransaction waiting on lock: 
HDDS-7925 issue present which can cause
 # Operation is very slow: Probability is low as after restart, its working 
fine 
 ## System is heavily loaded / very less disk / less memory causing slow of 
system, so there is pilling of submit request to ratis
 ## Network is very slow that peer node reply / handling is taking time or peer 
is slow
 # Locking is non-fair: Currently if same bucket read is heavily loaded, then 
there can be starvation for write-lock. But This is not observed from stack, 
can not see many read waiting or in-progress.

 

Based on above probability, I can see Lock Issue be the reason, but still can 
not confirm based on single stack available.

If Any error log shown get bucket failure and write operation for same bucket 
is comming

OR

multiple thread stack dump showing stuck at same place to prove is required.

 

> OzoneManager hangs when submitRequestToRatis
> --------------------------------------------
>
>                 Key: HDDS-8366
>                 URL: https://issues.apache.org/jira/browse/HDDS-8366
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, Ozone Manager
>    Affects Versions: 1.3.0
>            Reporter: Hongbing Wang
>            Assignee: Sumit Agrawal
>            Priority: Critical
>         Attachments: om.abnormal.jstack, om.normal.jstack, om_rpc_callqueue_ 
> accumulation.png
>
>
> OM all rpc handlers hang when calling 
> `OzoneManagerRatisServer#submitRequestToRatis`, the key stack as follows:
> {noformat}
> "IPC Server handler 99 on 9862" #187 daemon prio=5 os_prio=0 
> tid=0x00007f1897b4c000 nid=0x10fa63 waiting on condition [0x00007f05a5b48000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00007f08a185e050> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
>       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
>       at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
>       at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
>       at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$696/251832800.apply(Unknown
>  Source)
>       at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>       at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
>       at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2716)
>    Locked ownable synchronizers:
>       - None
> {noformat}
> The complete abnormal stack see: [^om.abnormal.jstack] (also see [web 
> link|https://github.com/whbing/issue_logs/blob/main/ozone/omrpc20230323/om.abnormal.jstack])
> Compare the normal stack see:  [^om.normal.jstack] (also see [web 
> link|https://github.com/whbing/issue_logs/blob/main/ozone/omrpc20230323/om.normal.jstack])
> ipc debug log as follow:
> {noformat}
> 2023-03-22 13:17:56,135 [Socket Reader #1 for port 9862] DEBUG 
> org.apache.hadoop.ipc.Server: Successfully authorized userInfo {
>   effectiveUser: "xxx"
> }
> protocol: "org.apache.hadoop.hdds.protocol.GenericRefreshProtocol"
> 2023-03-22 13:17:56,135 [Socket Reader #1 for port 9862] DEBUG 
> org.apache.hadoop.ipc.Server:  got #0
> 2023-03-22 13:17:57,143 [IPC Server idle connection scanner for port 9862] 
> DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for 
> port 9862: task running
> 2023-03-22 13:17:57,946 [Socket Reader #1 for port 9862] DEBUG 
> org.apache.hadoop.ipc.Server:  got #-4
> 2023-03-22 13:17:57,946 [Socket Reader #1 for port 9862] DEBUG 
> org.apache.hadoop.ipc.Server: Received ping message
> 2023-03-22 13:18:07,143 [IPC Server idle connection scanner for port 9862] 
> DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for 
> port 9862: task running
> 2023-03-22 13:18:13,536 [Socket Reader #1 for port 9862] DEBUG 
> org.apache.hadoop.ipc.Server:  got #-4
> 2023-03-22 13:18:13,536 [Socket Reader #1 for port 9862] DEBUG 
> org.apache.hadoop.ipc.Server: Received ping message
> {noformat}
> RPCs are backlogged in callQueue: 
>  !om_rpc_callqueue_ accumulation.png! 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to