[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

Stephen O'Donnell (Jira) Tue, 03 Mar 2020 06:52:33 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050280#comment-17050280
 ]


Stephen O'Donnell commented on HDDS-3116:
-----------------------------------------

This problem is actually quite easy to reproduce, simply by starting and 
stopping a datanode docker container over and over until the problem happens. A 
futher restart makes the problem go away. That led me to believe this is a race 
condition in the initialisation code rather than some bad state stored on the 
DN.

Adding a bit of extra debug, I believe the problem is created in the 
constructor of the DatanodeStateMachine class.

The problem manifests in this line in XceiverServerRatis:

{code}
context.addReport(context.getParent().getContainer().getPipelineReport());
{code}

Adding some debug here, I proved the NPE is happening as getContainer() returns 
null.

This method is calling an instance of DatanodeStateMachine and in that class 
you can see the OzoneContainer variable is set in the constructor and never 
changed, except on shutdown.

Checking that constructor, I can see the following:

{code}
  public DatanodeStateMachine(DatanodeDetails datanodeDetails,
      Configuration conf, CertificateClient certClient,
      HddsDatanodeStopService hddsDatanodeStopService) throws IOException {
    OzoneConfiguration ozoneConf = new OzoneConfiguration(conf);
    DatanodeConfiguration dnConf =
        ozoneConf.getObject(DatanodeConfiguration.class);

    this.hddsDatanodeStopService = hddsDatanodeStopService;
    this.conf = conf;
    this.datanodeDetails = datanodeDetails;
    executorService = HadoopExecutors.newCachedThreadPool(
                new ThreadFactoryBuilder().setDaemon(true)
            .setNameFormat("Datanode State Machine Thread - %d").build());
    connectionManager = new SCMConnectionManager(conf);
    context = new StateContext(this.conf, DatanodeStates.getInitState(), this);
    container = new OzoneContainer(this.datanodeDetails,
                                   ozoneConf, context, certClient);
    ...                    
{code}

Note the second last line - the current object **which has not finished 
constructing** is passed to create a new StateContext. This context is then 
passed to the new OzoneContainer call on the last line.

Following the path further, in the constructor for OzoneContainer we can see it 
calls the following and passes the context to it:

{code}
    this.writeChannel = XceiverServerRatis.newXceiverServerRatis(
        datanodeDetails, config, hddsDispatcher, controller, certClient,
        context);
{code}

Which creates the XceiverServerRatis, which has its own thread pool. If this 
Ratis server calls sendPipelineReport before the OzoneContainerConstructor and 
DatanodeStateMachine constructors complete, this bug occurs and the DN will 
never startup.

The question is how to fix this as ultimately the RatisServer startup code is 
accessing two other objects that have not completed construction. We might be 
able to put some synchronization in to help with a quick fix.


> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> -----------------------------------------------------------------------------
>
>                 Key: HDDS-3116
>                 URL: https://issues.apache.org/jira/browse/HDDS-3116
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Nanda kumar
>            Priority: Major
>         Attachments: full_logs.txt
>
>
> While working on a network Topology test (HDDS-3084) which does the following:
> 1. Start a cluster with 6 DNs and 2 racks.
> 2. Create a volume, bucket and a single key.
> 3. Stop one rack of hosts using "docker-compose down"
> 4. Read the data from the single key
> 5. Start the 3 down hosts
> 6. Stop the other 3 hosts
> 7. Attempt to read the key again.
> At step 5 I sometimes see this stack trace in one of the DNs and it fails to 
> full come up:
> {code}
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Attempting to start container services.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ratis.XceiverServerRatis: Starting XceiverServerRatis 
> 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
> statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
> scm:9861 for past 15000 seconds.
> java.io.IOException: java.lang.NullPointerException
>       at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>       at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>       at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>       at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>       at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
>       at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
>       at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
>       at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
>       at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
>       at 
> org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
>       at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:112)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:112)
>       at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>       at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>       ... 3 more
> {code}
> The DN does not recover from this automatically, although I confirmed that a 
> full cluster restart fixed it (docker-compose stop; docker-compose start). I 
> will try to confirm if a restart of the stuck DN would fix it or not too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

Reply via email to