George Jahad created HDDS-8952:
----------------------------------

             Summary: MiniOzoneHAClusterImpl frequently hangs when writing keys 
to bucket.
                 Key: HDDS-8952
                 URL: https://issues.apache.org/jira/browse/HDDS-8952
             Project: Apache Ozone
          Issue Type: Sub-task
            Reporter: George Jahad


MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.

To show the problem, we have a simple test that just writes 10,000 keys.  It 
regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster.  

This is a problem because the MiniOzoneHAClusterImpl is critical for testing 
the snapshot bootstrap code.  We believe it is the root cause of this issue: 
https://issues.apache.org/jira/browse/HDDS-8876

This simple test shows the problem: 
https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple

In my experience it hangs 20-30% of the time when running on repeat in intellij.


When it does hang the client thread is in the process of writing/commiting the 
key, which the server side translator thread is waiting on a future.  I believe 
that future is waiting on a response from the active follower.  I've include 
stack traces for both threads below when the test is hung.

Occasionally when it hangs we see the client thread in exactly the same place 
but there is no corresponding server side translator thread.  We are guessing 
in this case the server side thread gets killed by an unhandled exception, but 
the client side thread doesn't notice and just waits forever instead of 
retrying.

Here are the stack traces for the two threads.

{{Client thread:}}
{{"main@1" prio=5 tid=0x1 nid=NA waiting}}
{{  java.lang.Thread.State: WAITING}}
{{      at java.lang.Object.wait(Object.java:-1)}}
{{      at java.lang.Object.wait(Object.java:502)}}
{{      at 
org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)}}
{{      at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)}}
{{      at org.apache.hadoop.ipc.Client.call(Client.java:1530)}}
{{      at org.apache.hadoop.ipc.Client.call(Client.java:1427)}}
{{      at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)}}
{{      at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)}}
{{      at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
{{      at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)}}
{{      at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
{{      at java.lang.reflect.Method.invoke(Method.java:498)}}
{{      at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)}}
{{      at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)}}
{{      at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)}}
{{      at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)}}
{{      - locked <0x208df> (a 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call)}}
{{      at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)}}
{{      at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)}}
{{      at 
org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)}}
{{      at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)}}
{{      at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)}}
{{      at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)}}
{{      at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)}}
{{      at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)}}
{{      - locked <0x208e0> (a 
org.apache.hadoop.ozone.client.io.KeyOutputStream)}}
{{      at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)}}
{{      - locked <0x208e1> (a 
org.apache.hadoop.ozone.client.io.OzoneOutputStream)}}
{{      at 
org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)}}
{{      at 
org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)}}

 

 

{{Server thread:}}

{{"IPC Server handler 5 on default port 15133@101855" daemon prio=5 tid=0x159ab 
nid=NA waiting}}
{{  java.lang.Thread.State: WAITING}}
{{      at sun.misc.Unsafe.park(Unsafe.java:-1)}}
{{      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)}}
{{      at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)}}
{{      at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)}}
{{      at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)}}
{{      at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)}}
{{      at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)}}
{{      at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)}}
{{      at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)}}
{{      at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)}}
{{      at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown
 Source:-1)}}
{{      at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)}}
{{      at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)}}
{{      at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to