[
https://issues.apache.org/jira/browse/HDDS-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739376#comment-17739376
]
Attila Doroszlai edited comment on HDDS-8952 at 7/2/23 1:20 PM:
----------------------------------------------------------------
[~georgeJahad], sorry for disabling the test without notification. I should
have tagged you in HDDS-8880. Fork timeout (after 20 minutes) happened in 8
out of 17 runs after your commit, not to mention runs in PRs and forks. First
I marked it as flaky, but it didn't help, due to the nature of the failure
(fork timeout, instead of test failure/timeout). See commit history:
https://github.com/apache/ozone/commits/master?after=43d8e667c375bb9fe8829ba50b86f7a87e3edf5f+0
Here's my latest attempt to reduce the number of test cases disabled:
https://github.com/apache/ozone/compare/master...adoroszlai:hadoop-ozone:HDDS-8880-tweak
In this patch, only {{testInstallSnapshot}} is left disabled.
{{testInstallIncrementalSnapshotWithFailure}} is marked as flaky. Mini cluster
shutdown timed out in 2/100 runs, but otherwise it seems OK.
Please feel free to make any changes to {{TestOMRatisSnapshots}}, taking my
patch as a starting point if you wish.
was (Author: adoroszlai):
[~georgeJahad], sorry for disabling the test without notification. I should
have tagged you in HDDS-8880. Fork timeout (after 20 minutes) happened in 8
out of 17 runs after your commit, not to mention runs in PRs and forks. First
I marked it as flaky, but it didn't help, due to the nature of the failure
(fork timeout, instead of test failure/timeout). See commit history:
https://github.com/apache/ozone/commits/master?after=43d8e667c375bb9fe8829ba50b86f7a87e3edf5f+0
Here's my latest attempt to reduce the number of test cases disabled:
https://github.com/apache/ozone/compare/master...adoroszlai:hadoop-ozone:HDDS-8880-tweak
Only {{testInstallSnapshot}} is disabled.
{{testInstallIncrementalSnapshotWithFailure}} is marked as flaky. Mini cluster
shutdown timed out in 2/100 runs, but otherwise it seems OK.
Please feel free to make any changes to {{TestOMRatisSnapshots}}, taking my
patch as a starting point if you wish.
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> --------------------------------------------------------------------
>
> Key: HDDS-8952
> URL: https://issues.apache.org/jira/browse/HDDS-8952
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: George Jahad
> Priority: Blocker
>
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> To show the problem, we have a simple test that just writes 10,000 keys. It
> regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster.
>
> This is a problem because the MiniOzoneHAClusterImpl is critical for testing
> the snapshot bootstrap code. We believe it is the root cause of this issue:
> https://issues.apache.org/jira/browse/HDDS-8876
> This simple test shows the problem:
> [https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple]
> In my experience it hangs 20-30% of the time when running on repeat in
> intellij.
> When it does hang the client thread is in the process of writing/commiting
> the key, which the server side translator thread is waiting on a future. I
> believe that future is waiting on a ratis response from the active follower.
> I've include stack traces for both threads below when the test is hung.
> Occasionally when it hangs we see the client thread in exactly the same place
> but there is no corresponding server side translator thread. We are guessing
> in this case the server side thread gets killed by an unhandled exception,
> but the client side thread doesn't notice and just waits forever instead of
> retrying.
> Here are the stack traces for the two threads.
> {{Client thread:}}
> {{"main@1" prio=5 tid=0x1 nid=NA waiting}}
> {{ java.lang.Thread.State: WAITING}}
> {{ at java.lang.Object.wait(Object.java:-1)}}
> {{ at java.lang.Object.wait(Object.java:502)}}
> {{ at
> org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)}}
> {{ at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)}}
> {{ at org.apache.hadoop.ipc.Client.call(Client.java:1530)}}
> {{ at org.apache.hadoop.ipc.Client.call(Client.java:1427)}}
> {{ at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)}}
> {{ at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)}}
> {{ at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{ at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)}}
> {{ at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{ at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)}}
> {{ - locked <0x208df> (a
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)}}
> {{ at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)}}
> {{ at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)}}
> {{ at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)}}
> {{ - locked <0x208e0> (a
> org.apache.hadoop.ozone.client.io.KeyOutputStream)}}
> {{ at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)}}
> {{ - locked <0x208e1> (a
> org.apache.hadoop.ozone.client.io.OzoneOutputStream)}}
> {{ at
> org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)}}
> {{ at
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)}}
>
>
> {{Server thread:}}
> {{"IPC Server handler 5 on default port 15133@101855" daemon prio=5
> tid=0x159ab nid=NA waiting}}
> {{ java.lang.Thread.State: WAITING}}
> {{ at sun.misc.Unsafe.park(Unsafe.java:-1)}}
> {{ at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)}}
> {{ at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)}}
> {{ at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)}}
> {{ at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)}}
> {{ at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)}}
> {{ at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)}}
> {{ at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown
> Source:-1)}}
> {{ at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)}}
> {{ at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]