[ 
https://issues.apache.org/jira/browse/HDDS-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739341#comment-17739341
 ] 

George Jahad commented on HDDS-8952:
------------------------------------

[~adoroszlai] 

We are concerned about the fact that TestOMRatisSnapshots is disabled while we 
continue to make significant changes to the snapshotting codebase.  I'm afraid 
people will make breaking changes that go unnoticed because those tests are 
disabled.

We'd like to come up with a workaround that keeps as much of the testing as 
possible while finding a workaround for the flakiness.  Here is our current 
proposal:

The TestOMRatisSnapshots flakiness got bad in this commit: 
9f6cb9de5596fc2228bb6baf58ead6b713036757 because it added a lot more writes to 
the test.  So we would like to restore that class to the previous revision: 
cc1d2b3984194d22027fc119c05329e1c9e73c5f

 

Then we will move the latest changes to a new class that will have the ratis 
parameters modified to eliminate the flakiness.

There have been two tests that have changed since that revision: 
testInstallIncrementalSnapshot() and testInstallSnapshot().  So those two tests 
will exist in both classes: the old versions in the old class with the old 
ratis parameters, and those two tests in a new class with the parameters 
changed.

That will keep the CI working for now until we figure out the right solution.  
Does that seem reasonable?

> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> --------------------------------------------------------------------
>
>                 Key: HDDS-8952
>                 URL: https://issues.apache.org/jira/browse/HDDS-8952
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: George Jahad
>            Priority: Blocker
>
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> To show the problem, we have a simple test that just writes 10,000 keys.  It 
> regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster. 
>  
> This is a problem because the MiniOzoneHAClusterImpl is critical for testing 
> the snapshot bootstrap code.  We believe it is the root cause of this issue: 
> https://issues.apache.org/jira/browse/HDDS-8876
> This simple test shows the problem: 
> [https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple]
> In my experience it hangs 20-30% of the time when running on repeat in 
> intellij.
> When it does hang the client thread is in the process of writing/commiting 
> the key, which the server side translator thread is waiting on a future.  I 
> believe that future is waiting on a ratis response from the active follower.  
> I've include stack traces for both threads below when the test is hung.
> Occasionally when it hangs we see the client thread in exactly the same place 
> but there is no corresponding server side translator thread.  We are guessing 
> in this case the server side thread gets killed by an unhandled exception, 
> but the client side thread doesn't notice and just waits forever instead of 
> retrying.
> Here are the stack traces for the two threads.
> {{Client thread:}}
> {{"main@1" prio=5 tid=0x1 nid=NA waiting}}
> {{  java.lang.Thread.State: WAITING}}
> {{      at java.lang.Object.wait(Object.java:-1)}}
> {{      at java.lang.Object.wait(Object.java:502)}}
> {{      at 
> org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)}}
> {{      at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)}}
> {{      at org.apache.hadoop.ipc.Client.call(Client.java:1530)}}
> {{      at org.apache.hadoop.ipc.Client.call(Client.java:1427)}}
> {{      at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)}}
> {{      at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)}}
> {{      at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{      at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)}}
> {{      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{      at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)}}
> {{      - locked <0x208df> (a 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)}}
> {{      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)}}
> {{      at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)}}
> {{      at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)}}
> {{      - locked <0x208e0> (a 
> org.apache.hadoop.ozone.client.io.KeyOutputStream)}}
> {{      at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)}}
> {{      - locked <0x208e1> (a 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream)}}
> {{      at 
> org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)}}
> {{      at 
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)}}
>  
>  
> {{Server thread:}}
> {{"IPC Server handler 5 on default port 15133@101855" daemon prio=5 
> tid=0x159ab nid=NA waiting}}
> {{  java.lang.Thread.State: WAITING}}
> {{      at sun.misc.Unsafe.park(Unsafe.java:-1)}}
> {{      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)}}
> {{      at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)}}
> {{      at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)}}
> {{      at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)}}
> {{      at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)}}
> {{      at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)}}
> {{      at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown
>  Source:-1)}}
> {{      at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)}}
> {{      at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)}}
> {{      at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to