[jira] [Updated] (HDDS-5033) SCM may not be able to know full port list of Datanode after Datanode is started.

ASF GitHub Bot (Jira) Mon, 29 Mar 2021 01:39:06 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HDDS-5033:
---------------------------------
    Labels: pull-request-available  (was: )

> SCM may not be able to know full port list of Datanode after Datanode is 
> started.
> ---------------------------------------------------------------------------------
>
>                 Key: HDDS-5033
>                 URL: https://issues.apache.org/jira/browse/HDDS-5033
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM HA
>    Affects Versions: 1.2.0
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 企业微信截图_097abd79-0ea4-487b-9b07-6bc2330385ef.png, 
> 企业微信截图_c0bd5dde-98ee-4350-914d-2e0069ea8602.png, 截屏2021-03-26 上午11.15.14.png
>
>
> Please check attachment.
> After restart DN, the SCM may not know the full ports of that DN.
> This issue can not be solved without restart SCM. The consequence is that 
> Datanode can not participate any pipeline, and there will be continually NPE 
> in DN.
> {code:java}
> 2021-03-25 15:04:16,322 [Command processor thread] ERROR 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: 
> Critical Error : Command processor thread encountered an error. Thread: 
> Thread[Command processor thread,5,main]
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdds.ratis.RatisHelper.toRaftPeerAddress(RatisHelper.java:99)
>         at 
> org.apache.hadoop.hdds.ratis.RatisHelper.raftPeerBuilderFor(RatisHelper.java:119)
>         at 
> org.apache.hadoop.hdds.ratis.RatisHelper.toRaftPeer(RatisHelper.java:111)
>         at 
> org.apache.hadoop.hdds.ratis.RatisHelper.newRaftGroup(RatisHelper.java:149)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CreatePipelineCommandHandler.handle(CreatePipelineCommandHandler.java:91)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:99)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$2(DatanodeStateMachine.java:506)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-03-25 15:04:16,323 [Command processor thread] ERROR 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: 
> Critical Error : Command processor thread encountered an error. Thread: 
> Thread[Command processor thread,5,main]
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdds.ratis.RatisHelper.toRaftPeerAddress(RatisHelper.java:99)
>         at 
> org.apache.hadoop.hdds.ratis.RatisHelper.raftPeerBuilderFor(RatisHelper.java:119)
>         at 
> org.apache.hadoop.hdds.ratis.RatisHelper.toRaftPeer(RatisHelper.java:111)
>         at 
> org.apache.hadoop.hdds.ratis.RatisHelper.newRaftGroup(RatisHelper.java:149)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CreatePipelineCommandHandler.handle(CreatePipelineCommandHandler.java:91)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:99)
>         at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$2(DatanodeStateMachine.java:506)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> After restart SCM, this issue gone.
> The issue should be: SCMNodeManager just record the DatanodeDetails once 
> during register.
> But for DN, it won’t record the admin, server, client port into 
> DatanodeDetails until its ratis server is up.
> Thus there is contention here: if the register request is reported before 
> ratis server is up, SCM won’t know full port list of that DN.
>  
>  *UPDATE*
> {code:java}
> public void start(String clusterId) throws IOException {
>   if (!isStarted.compareAndSet(false, true)) {
>     LOG.info("Ignore. OzoneContainer already started.");
>     return;
>   }
>   LOG.info("Attempting to start container services.");
>   startContainerScrub();
>   replicationServer.start();
>   datanodeDetails.setPort(Name.REPLICATION, replicationServer.getPort());
>   writeChannel.start();
>   readChannel.start();
>   hddsDispatcher.init();
>   hddsDispatcher.setClusterId(clusterId);
>   blockDeletingService.start();
> }
> {code}
> We are doing SCM HA test, which means the start will called multi times, and 
> only the first SCM connection will succeed in the CAS. The second SCM 
> connection will won't wait for writeChannel.start(); thus get a partial port 
> list.
>  
> *UPDATE again*
> It is the contention of connect to multi SCMs at DN side. We also needs add 
> lock to DatanodeDetails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-5033) SCM may not be able to know full port list of Datanode after Datanode is started.

Reply via email to