[ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056781#comment-17056781
 ] 

Stephen O'Donnell commented on HDDS-3116:
-----------------------------------------

I could reproduce this problem quite easily in docker before adding the 
synchronised fix, and afterwards I have not been able to reproduce it using the 
same technique.

I think long term this code should be refactored, but the synchronized blocks 
seem to address the immediate issue. It would be easily for another change to 
introduce more issues in this area due to these circular dependencies however.

The reason the sync blocks work, is because XceiverServerRatis ultimately tries 
to get the OzoneContainer instance via the DatanodeStateMachine instance in 
another thread while both those objects are constructing. It is the constructor 
of OzoneContainer which starts the XceiverServerRatis, so synchronising around 
the construction of OzoneContainer and its accessor in DatanodeStateMachine 
forces the XceiverServer to wait until construction is finished before it can 
get what it needs.


> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> -----------------------------------------------------------------------------
>
>                 Key: HDDS-3116
>                 URL: https://issues.apache.org/jira/browse/HDDS-3116
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Blocker
>              Labels: pull-request-available
>         Attachments: full_logs.txt
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> While working on a network Topology test (HDDS-3084) which does the following:
> 1. Start a cluster with 6 DNs and 2 racks.
> 2. Create a volume, bucket and a single key.
> 3. Stop one rack of hosts using "docker-compose down"
> 4. Read the data from the single key
> 5. Start the 3 down hosts
> 6. Stop the other 3 hosts
> 7. Attempt to read the key again.
> At step 5 I sometimes see this stack trace in one of the DNs and it fails to 
> full come up:
> {code}
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Attempting to start container services.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ratis.XceiverServerRatis: Starting XceiverServerRatis 
> 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
> statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
> scm:9861 for past 15000 seconds.
> java.io.IOException: java.lang.NullPointerException
>       at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>       at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>       at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>       at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>       at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
>       at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
>       at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
>       at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
>       at 
> org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
>       at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:112)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:112)
>       at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>       at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>       ... 3 more
> {code}
> The DN does not recover from this automatically, although I confirmed that a 
> full cluster restart fixed it (docker-compose stop; docker-compose start). I 
> will try to confirm if a restart of the stuck DN would fix it or not too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to