[ 
https://issues.apache.org/jira/browse/HDDS-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739376#comment-17739376
 ] 

Attila Doroszlai edited comment on HDDS-8952 at 7/2/23 1:20 PM:
----------------------------------------------------------------

[~georgeJahad], sorry for disabling the test without notification.  I should 
have tagged you in HDDS-8880.  Fork timeout (after 20 minutes) happened in 8 
out of 17 runs after your commit, not to mention runs in PRs and forks.  First 
I marked it as flaky, but it didn't help, due to the nature of the failure 
(fork timeout, instead of test failure/timeout).  See commit history: 
https://github.com/apache/ozone/commits/master?after=43d8e667c375bb9fe8829ba50b86f7a87e3edf5f+0

Here's my latest attempt to reduce the number of test cases disabled:
https://github.com/apache/ozone/compare/master...adoroszlai:hadoop-ozone:HDDS-8880-tweak

In this patch, only {{testInstallSnapshot}} is left disabled.  
{{testInstallIncrementalSnapshotWithFailure}} is marked as flaky.  Mini cluster 
shutdown timed out in 2/100 runs, but otherwise it seems OK.

Please feel free to make any changes to {{TestOMRatisSnapshots}}, taking my 
patch as a starting point if you wish.


was (Author: adoroszlai):
[~georgeJahad], sorry for disabling the test without notification.  I should 
have tagged you in HDDS-8880.  Fork timeout (after 20 minutes) happened in 8 
out of 17 runs after your commit, not to mention runs in PRs and forks.  First 
I marked it as flaky, but it didn't help, due to the nature of the failure 
(fork timeout, instead of test failure/timeout).  See commit history: 
https://github.com/apache/ozone/commits/master?after=43d8e667c375bb9fe8829ba50b86f7a87e3edf5f+0

Here's my latest attempt to reduce the number of test cases disabled:
https://github.com/apache/ozone/compare/master...adoroszlai:hadoop-ozone:HDDS-8880-tweak

Only {{testInstallSnapshot}} is disabled.  
{{testInstallIncrementalSnapshotWithFailure}} is marked as flaky.  Mini cluster 
shutdown timed out in 2/100 runs, but otherwise it seems OK.

Please feel free to make any changes to {{TestOMRatisSnapshots}}, taking my 
patch as a starting point if you wish.

> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> --------------------------------------------------------------------
>
>                 Key: HDDS-8952
>                 URL: https://issues.apache.org/jira/browse/HDDS-8952
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: George Jahad
>            Priority: Blocker
>
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> To show the problem, we have a simple test that just writes 10,000 keys.  It 
> regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster. 
>  
> This is a problem because the MiniOzoneHAClusterImpl is critical for testing 
> the snapshot bootstrap code.  We believe it is the root cause of this issue: 
> https://issues.apache.org/jira/browse/HDDS-8876
> This simple test shows the problem: 
> [https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple]
> In my experience it hangs 20-30% of the time when running on repeat in 
> intellij.
> When it does hang the client thread is in the process of writing/commiting 
> the key, which the server side translator thread is waiting on a future.  I 
> believe that future is waiting on a ratis response from the active follower.  
> I've include stack traces for both threads below when the test is hung.
> Occasionally when it hangs we see the client thread in exactly the same place 
> but there is no corresponding server side translator thread.  We are guessing 
> in this case the server side thread gets killed by an unhandled exception, 
> but the client side thread doesn't notice and just waits forever instead of 
> retrying.
> Here are the stack traces for the two threads.
> {{Client thread:}}
> {{"main@1" prio=5 tid=0x1 nid=NA waiting}}
> {{  java.lang.Thread.State: WAITING}}
> {{      at java.lang.Object.wait(Object.java:-1)}}
> {{      at java.lang.Object.wait(Object.java:502)}}
> {{      at 
> org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)}}
> {{      at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)}}
> {{      at org.apache.hadoop.ipc.Client.call(Client.java:1530)}}
> {{      at org.apache.hadoop.ipc.Client.call(Client.java:1427)}}
> {{      at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)}}
> {{      at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)}}
> {{      at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{      at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)}}
> {{      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{      at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)}}
> {{      - locked <0x208df> (a 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)}}
> {{      at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)}}
> {{      - locked <0x208e0> (a 
> org.apache.hadoop.ozone.client.io.KeyOutputStream)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)}}
> {{      - locked <0x208e1> (a 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream)}}
> {{      at 
> org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)}}
> {{      at 
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)}}
>  
>  
> {{Server thread:}}
> {{"IPC Server handler 5 on default port 15133@101855" daemon prio=5 
> tid=0x159ab nid=NA waiting}}
> {{  java.lang.Thread.State: WAITING}}
> {{      at sun.misc.Unsafe.park(Unsafe.java:-1)}}
> {{      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)}}
> {{      at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)}}
> {{      at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)}}
> {{      at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)}}
> {{      at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)}}
> {{      at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)}}
> {{      at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown
>  Source:-1)}}
> {{      at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)}}
> {{      at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to