Soumitra Sulav created HDDS-9055:
------------------------------------

             Summary: Datanode decommission Failed, Follower never received the 
command
                 Key: HDDS-9055
                 URL: https://issues.apache.org/jira/browse/HDDS-9055
             Project: Apache Ozone
          Issue Type: Bug
          Components: SCM HA
            Reporter: Soumitra Sulav
             Fix For: 1.4.0


*Issue:*
As per one of the Cloudera system test, 2 Datanode are scheduled for 
decommission post data write and data pipeline close.
LEADER node has received the scheduled command for decommission as expected 
from the test, But the FOLLOWER never received the decommission.

*Summary logs :*
Follower
{code:java}
19:58:04,931 : persistedOpState: DECOMMISSIONING, the value stored in SCM 
(IN_SERVICE, 0)
19:58:10,016 : persistedOpState: IN_SERVICE,  the value stored in SCM 
(DECOMMISSIONING, 0)

{code}

Leader: TimeOut
{code:java}
2023-07-20 19:38:31,689 : persistedOpState: IN_SERVICE, the value stored in SCM 
(DECOMMISSIONING, 0)
...... multiple retries .......
2023-07-20 19:55:54,323 : persistedOpState: IN_SERVICE, the value stored in SCM 
(DECOMMISSIONING, 0)
2023-07-20 19:56:24,344 : persistedOpState: IN_SERVICE, the value stored in SCM 
(DECOMMISSIONING, 0)

2023-07-20 19:58:04,931 : persistedOpState: DECOMMISSIONING, the value stored 
in SCM (IN_SERVICE, 0)
{code}

*Detailed logs :*

{code:java}
FOLLOWER

2023-07-20 19:58:04,931 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: 
Update the operationalState saved in follower SCM for 
33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host: 
quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: 70976812254805668, 
persistedOpState: DECOMMISSIONING, persistedOpStateExpiryEpochSec: 0} as the 
reported value does not match the value stored in SCM (IN_SERVICE, 0)

2023-07-20 19:58:10,016 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: 
Update the operationalState saved in follower SCM for 
33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host: 
quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: 70976812254805668, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0} as the 
reported value does not match the value stored in SCM (DECOMMISSIONING, 0)


LEADER

2023-07-20 19:56:24,344 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: 
Scheduling a command to update the operationalState persisted on 
33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host: 
quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: 70976812254805668, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0} as the 
reported value does not match the value stored in SCM (DECOMMISSIONING, 0)

2023-07-20 19:58:04,931 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: 
Scheduling a command to update the operationalState persisted on 
33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host: 
quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default-rack, certSerialId: 70976812254805668, 
persistedOpState: DECOMMISSIONING, persistedOpStateExpiryEpochSec: 0} as the 
reported value does not match the value stored in SCM (IN_SERVICE, 0)
{code}

PFA SCM logs for more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to