[ 
https://issues.apache.org/jira/browse/HDDS-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738518#comment-17738518
 ] 

Attila Doroszlai edited comment on HDDS-8952 at 6/29/23 8:41 PM:
-----------------------------------------------------------------

[~georgeJahad], I think this duplicates HDDS-8880. I've linked thread dumps 
from a few failed test runs there, they are similar to what you posted.

Thanks for pointing out the problem with simple key creation. I've tried that 
(locally) in {{TestOMRatisSnapshots}} and in a similar test class with Ozone's 
default config. The problem only happens with the specific configurations set 
in {{{}TestOMRatisSnapshots{}}}. Maybe there is a bug in how OM uses Ratis, or 
in Ratis itself, but maybe the configs of the test are simply wrong.

default config: 
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5415806490
snapshot config: 
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5415810459

 


was (Author: adoroszlai):
[~georgeJahad], I think this duplicates HDDS-8880.  I've linked thread dumps 
from a few failed test runs there, they are similar to what you posted.

Thanks for pointing out the problem with simple key creation.  I've tried that 
(locally) in {{TestOMRatisSnapshots}} and in a similar test class with Ozone's 
default config.  The problem only happens with the specific configurations set 
in {{TestOMRatisSnapshots}}.  Maybe there is a bug in how OM uses Ratis, or in 
Ratis itself, but maybe the configs of the test are simply wrong.

I'll post links to CI when I have some results for these.

> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> --------------------------------------------------------------------
>
>                 Key: HDDS-8952
>                 URL: https://issues.apache.org/jira/browse/HDDS-8952
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: George Jahad
>            Priority: Blocker
>
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> To show the problem, we have a simple test that just writes 10,000 keys.  It 
> regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster. 
>  
> This is a problem because the MiniOzoneHAClusterImpl is critical for testing 
> the snapshot bootstrap code.  We believe it is the root cause of this issue: 
> https://issues.apache.org/jira/browse/HDDS-8876
> This simple test shows the problem: 
> [https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple]
> In my experience it hangs 20-30% of the time when running on repeat in 
> intellij.
> When it does hang the client thread is in the process of writing/commiting 
> the key, which the server side translator thread is waiting on a future.  I 
> believe that future is waiting on a ratis response from the active follower.  
> I've include stack traces for both threads below when the test is hung.
> Occasionally when it hangs we see the client thread in exactly the same place 
> but there is no corresponding server side translator thread.  We are guessing 
> in this case the server side thread gets killed by an unhandled exception, 
> but the client side thread doesn't notice and just waits forever instead of 
> retrying.
> Here are the stack traces for the two threads.
> {{Client thread:}}
> {{"main@1" prio=5 tid=0x1 nid=NA waiting}}
> {{  java.lang.Thread.State: WAITING}}
> {{      at java.lang.Object.wait(Object.java:-1)}}
> {{      at java.lang.Object.wait(Object.java:502)}}
> {{      at 
> org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)}}
> {{      at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)}}
> {{      at org.apache.hadoop.ipc.Client.call(Client.java:1530)}}
> {{      at org.apache.hadoop.ipc.Client.call(Client.java:1427)}}
> {{      at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)}}
> {{      at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)}}
> {{      at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{      at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)}}
> {{      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{      at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)}}
> {{      - locked <0x208df> (a 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)}}
> {{      at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)}}
> {{      - locked <0x208e0> (a 
> org.apache.hadoop.ozone.client.io.KeyOutputStream)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)}}
> {{      - locked <0x208e1> (a 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream)}}
> {{      at 
> org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)}}
> {{      at 
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)}}
>  
>  
> {{Server thread:}}
> {{"IPC Server handler 5 on default port 15133@101855" daemon prio=5 
> tid=0x159ab nid=NA waiting}}
> {{  java.lang.Thread.State: WAITING}}
> {{      at sun.misc.Unsafe.park(Unsafe.java:-1)}}
> {{      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)}}
> {{      at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)}}
> {{      at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)}}
> {{      at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)}}
> {{      at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)}}
> {{      at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)}}
> {{      at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown
>  Source:-1)}}
> {{      at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)}}
> {{      at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to