sodonnel opened a new pull request #630: HDDS-3116. Datanode sometimes fails to 
start with NPE when starting Ratis xceiver server
URL: https://github.com/apache/hadoop-ozone/pull/630
 
 
   ## What changes were proposed in this pull request?
   
   While working on a network Topology test (HDDS-3084) which stops and starts 
some DNs, I noticed that sometimes the DN fails to fully startup, giving a 
stack trace like the following:
   
   ```
   2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
ozoneimpl.OzoneContainer: Attempting to start container services.
   2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
ozoneimpl.OzoneContainer: Background container scanner has been disabled.
   2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
ratis.XceiverServerRatis: Starting XceiverServerRatis 
8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
   2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
scm:9861 for past 15000 seconds.
   java.io.IOException: java.lang.NullPointerException
        at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
        at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
        at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
        at 
org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
        at 
org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
        at 
org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
        at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
        at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
   Caused by: java.lang.NullPointerException
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
        at 
org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
        at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:112)
        at 
org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:112)
        at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
        at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
        ... 3 more
   ```
   
   The DN does not recover from this automatically. Restarting it may fix the 
problem as it occurs intermittently.
   
   After some debugging, this issue is created in the constructor of the 
DatanodeStateMachine class.
   
   The problem manifests in this line in XceiverServerRatis:
   
   {code}
   context.addReport(context.getParent().getContainer().getPipelineReport());
   {code}
   
   Adding some debug here, I proved the NPE is happening as getContainer() 
returns null.
   
   This method is calling an instance of DatanodeStateMachine and in that class 
you can see the OzoneContainer variable is set in the constructor and never 
changed, except on shutdown.
   
   Checking that constructor, I can see the following:
   
   {code}
     public DatanodeStateMachine(DatanodeDetails datanodeDetails,
         Configuration conf, CertificateClient certClient,
         HddsDatanodeStopService hddsDatanodeStopService) throws IOException {
       OzoneConfiguration ozoneConf = new OzoneConfiguration(conf);
       DatanodeConfiguration dnConf =
           ozoneConf.getObject(DatanodeConfiguration.class);
   
       this.hddsDatanodeStopService = hddsDatanodeStopService;
       this.conf = conf;
       this.datanodeDetails = datanodeDetails;
       executorService = HadoopExecutors.newCachedThreadPool(
                   new ThreadFactoryBuilder().setDaemon(true)
               .setNameFormat("Datanode State Machine Thread - %d").build());
       connectionManager = new SCMConnectionManager(conf);
       context = new StateContext(this.conf, DatanodeStates.getInitState(), 
this);
       container = new OzoneContainer(this.datanodeDetails,
                                   ozoneConf, context, certClient);
       ...                         
   {code}
   
   Note the second last line - the current object *which has not finished 
constructing* is passed to create a new StateContext. This context is then 
passed to the new OzoneContainer call on the last line.
   
   Following the path further, in the constructor for OzoneContainer we can see 
it calls the following and passes the context to it which is where it later 
fails:
   
   {code}
       this.writeChannel = XceiverServerRatis.newXceiverServerRatis(
           datanodeDetails, config, hddsDispatcher, controller, certClient,
           context);
   {code}
   
   Attempting to avoid a major refactor here, I feel the easiest way to fix 
this is to synchronize the initialization of the `container` variable and also 
synchronise its getter. That way, this race condition will not occur.
   
   I am also open to alternative suggestions on a good way to fix this.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-3116
   
   ## How was this patch tested?
   
   No additional tests added, but I manually ran a test case which reproduced 
the original issue and the issue no longer occurs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to