sodonnel opened a new pull request #630: HDDS-3116. Datanode sometimes fails to start with NPE when starting Ratis xceiver server URL: https://github.com/apache/hadoop-ozone/pull/630 ## What changes were proposed in this pull request? While working on a network Topology test (HDDS-3084) which stops and starts some DNs, I noticed that sometimes the DN fails to fully startup, giving a stack trace like the following: ``` 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ozoneimpl.OzoneContainer: Attempting to start container services. 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ozoneimpl.OzoneContainer: Background container scanner has been disabled. 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ratis.XceiverServerRatis: Starting XceiverServerRatis 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN statemachine.EndpointStateMachine: Unable to communicate to SCM server at scm:9861 for past 15000 seconds. java.io.IOException: java.lang.NullPointerException at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54) at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61) at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70) at org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284) at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296) at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418) at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232) at org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113) at org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.NullPointerException at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757) at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739) at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218) at org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160) at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:112) at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:112) at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ... 3 more ``` The DN does not recover from this automatically. Restarting it may fix the problem as it occurs intermittently. After some debugging, this issue is created in the constructor of the DatanodeStateMachine class. The problem manifests in this line in XceiverServerRatis: {code} context.addReport(context.getParent().getContainer().getPipelineReport()); {code} Adding some debug here, I proved the NPE is happening as getContainer() returns null. This method is calling an instance of DatanodeStateMachine and in that class you can see the OzoneContainer variable is set in the constructor and never changed, except on shutdown. Checking that constructor, I can see the following: {code} public DatanodeStateMachine(DatanodeDetails datanodeDetails, Configuration conf, CertificateClient certClient, HddsDatanodeStopService hddsDatanodeStopService) throws IOException { OzoneConfiguration ozoneConf = new OzoneConfiguration(conf); DatanodeConfiguration dnConf = ozoneConf.getObject(DatanodeConfiguration.class); this.hddsDatanodeStopService = hddsDatanodeStopService; this.conf = conf; this.datanodeDetails = datanodeDetails; executorService = HadoopExecutors.newCachedThreadPool( new ThreadFactoryBuilder().setDaemon(true) .setNameFormat("Datanode State Machine Thread - %d").build()); connectionManager = new SCMConnectionManager(conf); context = new StateContext(this.conf, DatanodeStates.getInitState(), this); container = new OzoneContainer(this.datanodeDetails, ozoneConf, context, certClient); ... {code} Note the second last line - the current object *which has not finished constructing* is passed to create a new StateContext. This context is then passed to the new OzoneContainer call on the last line. Following the path further, in the constructor for OzoneContainer we can see it calls the following and passes the context to it which is where it later fails: {code} this.writeChannel = XceiverServerRatis.newXceiverServerRatis( datanodeDetails, config, hddsDispatcher, controller, certClient, context); {code} Attempting to avoid a major refactor here, I feel the easiest way to fix this is to synchronize the initialization of the `container` variable and also synchronise its getter. That way, this race condition will not occur. I am also open to alternative suggestions on a good way to fix this. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-3116 ## How was this patch tested? No additional tests added, but I manually ran a test case which reproduced the original issue and the issue no longer occurs.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
