[
https://issues.apache.org/jira/browse/HDDS-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Glen Geng updated HDDS-5033:
----------------------------
Summary: SCM may not be able to know full port list of Datanode after
Datanode is started. (was: SCM may not be able to know full port list of
Datanode after Datanode is restarted.)
> SCM may not be able to know full port list of Datanode after Datanode is
> started.
> ---------------------------------------------------------------------------------
>
> Key: HDDS-5033
> URL: https://issues.apache.org/jira/browse/HDDS-5033
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: SCM HA
> Affects Versions: 1.2.0
> Reporter: Glen Geng
> Assignee: Glen Geng
> Priority: Major
> Attachments: 企业微信截图_097abd79-0ea4-487b-9b07-6bc2330385ef.png,
> 企业微信截图_c0bd5dde-98ee-4350-914d-2e0069ea8602.png, 截屏2021-03-26 上午11.15.14.png
>
>
> Please check attachment.
> After restart DN, the SCM may not know the full ports of that DN.
> This issue can not be solved without restart SCM. The consequence is that
> Datanode can not participate any pipeline, and there will be continually NPE
> in DN.
> {code:java}
> 2021-03-25 15:04:16,322 [Command processor thread] ERROR
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
> Critical Error : Command processor thread encountered an error. Thread:
> Thread[Command processor thread,5,main]
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdds.ratis.RatisHelper.toRaftPeerAddress(RatisHelper.java:99)
> at
> org.apache.hadoop.hdds.ratis.RatisHelper.raftPeerBuilderFor(RatisHelper.java:119)
> at
> org.apache.hadoop.hdds.ratis.RatisHelper.toRaftPeer(RatisHelper.java:111)
> at
> org.apache.hadoop.hdds.ratis.RatisHelper.newRaftGroup(RatisHelper.java:149)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CreatePipelineCommandHandler.handle(CreatePipelineCommandHandler.java:91)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:99)
> at
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$2(DatanodeStateMachine.java:506)
> at java.lang.Thread.run(Thread.java:748)
> 2021-03-25 15:04:16,323 [Command processor thread] ERROR
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
> Critical Error : Command processor thread encountered an error. Thread:
> Thread[Command processor thread,5,main]
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdds.ratis.RatisHelper.toRaftPeerAddress(RatisHelper.java:99)
> at
> org.apache.hadoop.hdds.ratis.RatisHelper.raftPeerBuilderFor(RatisHelper.java:119)
> at
> org.apache.hadoop.hdds.ratis.RatisHelper.toRaftPeer(RatisHelper.java:111)
> at
> org.apache.hadoop.hdds.ratis.RatisHelper.newRaftGroup(RatisHelper.java:149)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CreatePipelineCommandHandler.handle(CreatePipelineCommandHandler.java:91)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:99)
> at
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$2(DatanodeStateMachine.java:506)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>
> After restart SCM, this issue gone.
> The issue should be: SCMNodeManager just record the DatanodeDetails once
> during register.
> But for DN, it won’t record the admin, server, client port into
> DatanodeDetails until its ratis server is up.
> Thus there is contention here: if the register request is reported before
> ratis server is up, SCM won’t know full port list of that DN.
>
> *UPDATE*
> {code:java}
> public void start(String clusterId) throws IOException {
> if (!isStarted.compareAndSet(false, true)) {
> LOG.info("Ignore. OzoneContainer already started.");
> return;
> }
> LOG.info("Attempting to start container services.");
> startContainerScrub();
> replicationServer.start();
> datanodeDetails.setPort(Name.REPLICATION, replicationServer.getPort());
> writeChannel.start();
> readChannel.start();
> hddsDispatcher.init();
> hddsDispatcher.setClusterId(clusterId);
> blockDeletingService.start();
> }
> {code}
> We are doing SCM HA test, which means the start will called multi times, and
> only the first SCM connection will succeed in the CAS. The second SCM
> connection will won't wait for writeChannel.start(); thus get a partial port
> list.
>
> *UPDATE again*
> It is the contention of connect to multi SCMs at DN side. We also needs add
> lock to DatanodeDetails.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]