[
https://issues.apache.org/jira/browse/HDDS-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739341#comment-17739341
]
George Jahad commented on HDDS-8952:
------------------------------------
[~adoroszlai]
We are concerned about the fact that TestOMRatisSnapshots is disabled while we
continue to make significant changes to the snapshotting codebase. I'm afraid
people will make breaking changes that go unnoticed because those tests are
disabled.
We'd like to come up with a workaround that keeps as much of the testing as
possible while finding a workaround for the flakiness. Here is our current
proposal:
The TestOMRatisSnapshots flakiness got bad in this commit:
9f6cb9de5596fc2228bb6baf58ead6b713036757 because it added a lot more writes to
the test. So we would like to restore that class to the previous revision:
cc1d2b3984194d22027fc119c05329e1c9e73c5f
Then we will move the latest changes to a new class that will have the ratis
parameters modified to eliminate the flakiness.
There have been two tests that have changed since that revision:
testInstallIncrementalSnapshot() and testInstallSnapshot(). So those two tests
will exist in both classes: the old versions in the old class with the old
ratis parameters, and those two tests in a new class with the parameters
changed.
That will keep the CI working for now until we figure out the right solution.
Does that seem reasonable?
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> --------------------------------------------------------------------
>
> Key: HDDS-8952
> URL: https://issues.apache.org/jira/browse/HDDS-8952
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: George Jahad
> Priority: Blocker
>
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> To show the problem, we have a simple test that just writes 10,000 keys. It
> regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster.
>
> This is a problem because the MiniOzoneHAClusterImpl is critical for testing
> the snapshot bootstrap code. We believe it is the root cause of this issue:
> https://issues.apache.org/jira/browse/HDDS-8876
> This simple test shows the problem:
> [https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple]
> In my experience it hangs 20-30% of the time when running on repeat in
> intellij.
> When it does hang the client thread is in the process of writing/commiting
> the key, which the server side translator thread is waiting on a future. I
> believe that future is waiting on a ratis response from the active follower.
> I've include stack traces for both threads below when the test is hung.
> Occasionally when it hangs we see the client thread in exactly the same place
> but there is no corresponding server side translator thread. We are guessing
> in this case the server side thread gets killed by an unhandled exception,
> but the client side thread doesn't notice and just waits forever instead of
> retrying.
> Here are the stack traces for the two threads.
> {{Client thread:}}
> {{"main@1" prio=5 tid=0x1 nid=NA waiting}}
> {{ java.lang.Thread.State: WAITING}}
> {{ at java.lang.Object.wait(Object.java:-1)}}
> {{ at java.lang.Object.wait(Object.java:502)}}
> {{ at
> org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)}}
> {{ at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)}}
> {{ at org.apache.hadoop.ipc.Client.call(Client.java:1530)}}
> {{ at org.apache.hadoop.ipc.Client.call(Client.java:1427)}}
> {{ at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)}}
> {{ at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)}}
> {{ at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{ at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)}}
> {{ at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{ at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)}}
> {{ - locked <0x208df> (a
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)}}
> {{ at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)}}
> {{ at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)}}
> {{ at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)}}
> {{ - locked <0x208e0> (a
> org.apache.hadoop.ozone.client.io.KeyOutputStream)}}
> {{ at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)}}
> {{ - locked <0x208e1> (a
> org.apache.hadoop.ozone.client.io.OzoneOutputStream)}}
> {{ at
> org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)}}
> {{ at
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)}}
>
>
> {{Server thread:}}
> {{"IPC Server handler 5 on default port 15133@101855" daemon prio=5
> tid=0x159ab nid=NA waiting}}
> {{ java.lang.Thread.State: WAITING}}
> {{ at sun.misc.Unsafe.park(Unsafe.java:-1)}}
> {{ at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)}}
> {{ at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)}}
> {{ at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)}}
> {{ at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)}}
> {{ at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)}}
> {{ at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)}}
> {{ at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown
> Source:-1)}}
> {{ at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)}}
> {{ at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]