Nanda kumar created HDDS-830: -------------------------------- Summary: Datanode should not start XceiverServerRatis before getting version information from SCM Key: HDDS-830 URL: https://issues.apache.org/jira/browse/HDDS-830 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Ozone Datanode Affects Versions: 0.3.0 Reporter: Nanda kumar
If a datanode restarts quickly before SCM detects, it will rejoin the ratis ring (existing pipeline). Since SCM didn't detect this restart, the pipeline is not closed. Now there is a time gap after the datanode is started and it got the version information from SCM. During this time, the SCM ID in datanode is not set(null). If a client tries to use this pipeline during that time, the container state machine will throw {{java.lang.NullPointerException: scmId cannot be nul}}. This will cause {{RaftLogWorker}} to terminate resulting in datanode crash. {code} 2018-11-12 19:45:31,811 ERROR storage.RaftLogWorker (ExitUtils.java:terminate(86)) - Terminating with exit status 1: 407fd181-2ff7-4651-9a47-a0927ede4c51-RaftLogWorker failed. java.io.IOException: java.lang.NullPointerException: scmId cannot be null at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54) at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61) at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:83) at org.apache.ratis.server.storage.RaftLogWorker$StateMachineDataPolicy.getFromFuture(RaftLogWorker.java:76) at org.apache.ratis.server.storage.RaftLogWorker$WriteLog.execute(RaftLogWorker.java:344) at org.apache.ratis.server.storage.RaftLogWorker.run(RaftLogWorker.java:216) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException: scmId cannot be null at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204) at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:106) at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:242) at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:165) at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:206) at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:124) at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:274) at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.runCommand(ContainerStateMachine.java:280) at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$handleWriteChunk$1(ContainerStateMachine.java:301) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org