Stephen O'Donnell created HDDS-4766:
---------------------------------------

             Summary: Recon resets the Operational State of datanodes to 
IN_SERVICE
                 Key: HDDS-4766
                 URL: https://issues.apache.org/jira/browse/HDDS-4766
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
          Components: Ozone Recon
    Affects Versions: 1.1.0
            Reporter: Stephen O'Donnell


When a datanode is decommission or put to maintenance, its new state is 
persisted into the datanode.yaml file. When running on a cluster with Recon 
enabled, we can see conflicting commands are received repeatedly on the 
Datanode, eg:

{code}
datanode_3  | 2021-01-29 16:26:20,009 [EndpointStateMachine task thread for 
scm/172.24.0.6:9861 - 0 ] INFO endpoint.HeartbeatEndpointTask: Received SCM set 
operational state command. State: DECOMMISSIONED Expiry: 0 id 3645344
datanode_3  | 2021-01-29 16:26:50,012 [EndpointStateMachine task thread for 
recon/172.24.0.3:9891 - 0 ] INFO commands.SetNodeOperationalStateCommand: 
Create a new command to set op state IN_SERVICE 0 id is 3675347
{code}

This is happening because Recon delegates processing the DN heartbeats received 
by ReconNodeManager to an instance of SCMNodeManager running inside Recon. 
SCMNodeManager checks the reported state of the datanode matches the SCM memory 
state, and if they don't match, it issues a command to the DN to update its 
state.

In this case, Recon always tries to set the DN state back to IN_SERVICE.

The fix here, is probably to update the Recon in memory state before delegating 
the heartbeat to SCMNodeManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to