Stephen O'Donnell created HDDS-4766:
---------------------------------------
Summary: Recon resets the Operational State of datanodes to
IN_SERVICE
Key: HDDS-4766
URL: https://issues.apache.org/jira/browse/HDDS-4766
Project: Hadoop Distributed Data Store
Issue Type: Bug
Components: Ozone Recon
Affects Versions: 1.1.0
Reporter: Stephen O'Donnell
When a datanode is decommission or put to maintenance, its new state is
persisted into the datanode.yaml file. When running on a cluster with Recon
enabled, we can see conflicting commands are received repeatedly on the
Datanode, eg:
{code}
datanode_3 | 2021-01-29 16:26:20,009 [EndpointStateMachine task thread for
scm/172.24.0.6:9861 - 0 ] INFO endpoint.HeartbeatEndpointTask: Received SCM set
operational state command. State: DECOMMISSIONED Expiry: 0 id 3645344
datanode_3 | 2021-01-29 16:26:50,012 [EndpointStateMachine task thread for
recon/172.24.0.3:9891 - 0 ] INFO commands.SetNodeOperationalStateCommand:
Create a new command to set op state IN_SERVICE 0 id is 3675347
{code}
This is happening because Recon delegates processing the DN heartbeats received
by ReconNodeManager to an instance of SCMNodeManager running inside Recon.
SCMNodeManager checks the reported state of the datanode matches the SCM memory
state, and if they don't match, it issues a command to the DN to update its
state.
In this case, Recon always tries to set the DN state back to IN_SERVICE.
The fix here, is probably to update the Recon in memory state before delegating
the heartbeat to SCMNodeManager.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]