[
https://issues.apache.org/jira/browse/HDDS-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738518#comment-17738518
]
Attila Doroszlai edited comment on HDDS-8952 at 6/29/23 8:41 PM:
-----------------------------------------------------------------
[~georgeJahad], I think this duplicates HDDS-8880. I've linked thread dumps
from a few failed test runs there, they are similar to what you posted.
Thanks for pointing out the problem with simple key creation. I've tried that
(locally) in {{TestOMRatisSnapshots}} and in a similar test class with Ozone's
default config. The problem only happens with the specific configurations set
in {{{}TestOMRatisSnapshots{}}}. Maybe there is a bug in how OM uses Ratis, or
in Ratis itself, but maybe the configs of the test are simply wrong.
default config:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5415806490
snapshot config:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5415810459
was (Author: adoroszlai):
[~georgeJahad], I think this duplicates HDDS-8880. I've linked thread dumps
from a few failed test runs there, they are similar to what you posted.
Thanks for pointing out the problem with simple key creation. I've tried that
(locally) in {{TestOMRatisSnapshots}} and in a similar test class with Ozone's
default config. The problem only happens with the specific configurations set
in {{TestOMRatisSnapshots}}. Maybe there is a bug in how OM uses Ratis, or in
Ratis itself, but maybe the configs of the test are simply wrong.
I'll post links to CI when I have some results for these.
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> --------------------------------------------------------------------
>
> Key: HDDS-8952
> URL: https://issues.apache.org/jira/browse/HDDS-8952
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: George Jahad
> Priority: Blocker
>
> MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
> To show the problem, we have a simple test that just writes 10,000 keys. It
> regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster.
>
> This is a problem because the MiniOzoneHAClusterImpl is critical for testing
> the snapshot bootstrap code. We believe it is the root cause of this issue:
> https://issues.apache.org/jira/browse/HDDS-8876
> This simple test shows the problem:
> [https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple]
> In my experience it hangs 20-30% of the time when running on repeat in
> intellij.
> When it does hang the client thread is in the process of writing/commiting
> the key, which the server side translator thread is waiting on a future. I
> believe that future is waiting on a ratis response from the active follower.
> I've include stack traces for both threads below when the test is hung.
> Occasionally when it hangs we see the client thread in exactly the same place
> but there is no corresponding server side translator thread. We are guessing
> in this case the server side thread gets killed by an unhandled exception,
> but the client side thread doesn't notice and just waits forever instead of
> retrying.
> Here are the stack traces for the two threads.
> {{Client thread:}}
> {{"main@1" prio=5 tid=0x1 nid=NA waiting}}
> {{ java.lang.Thread.State: WAITING}}
> {{ at java.lang.Object.wait(Object.java:-1)}}
> {{ at java.lang.Object.wait(Object.java:502)}}
> {{ at
> org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)}}
> {{ at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)}}
> {{ at org.apache.hadoop.ipc.Client.call(Client.java:1530)}}
> {{ at org.apache.hadoop.ipc.Client.call(Client.java:1427)}}
> {{ at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)}}
> {{ at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)}}
> {{ at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{ at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)}}
> {{ at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
> {{ at java.lang.reflect.Method.invoke(Method.java:498)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)}}
> {{ - locked <0x208df> (a
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)}}
> {{ at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)}}
> {{ at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)}}
> {{ at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)}}
> {{ at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)}}
> {{ at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)}}
> {{ - locked <0x208e0> (a
> org.apache.hadoop.ozone.client.io.KeyOutputStream)}}
> {{ at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)}}
> {{ - locked <0x208e1> (a
> org.apache.hadoop.ozone.client.io.OzoneOutputStream)}}
> {{ at
> org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)}}
> {{ at
> org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)}}
>
>
> {{Server thread:}}
> {{"IPC Server handler 5 on default port 15133@101855" daemon prio=5
> tid=0x159ab nid=NA waiting}}
> {{ java.lang.Thread.State: WAITING}}
> {{ at sun.misc.Unsafe.park(Unsafe.java:-1)}}
> {{ at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)}}
> {{ at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)}}
> {{ at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)}}
> {{ at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)}}
> {{ at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)}}
> {{ at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)}}
> {{ at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown
> Source:-1)}}
> {{ at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)}}
> {{ at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)}}
> {{ at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]