Soumitra Sulav created HDDS-9055:
------------------------------------
Summary: Datanode decommission Failed, Follower never received the
command
Key: HDDS-9055
URL: https://issues.apache.org/jira/browse/HDDS-9055
Project: Apache Ozone
Issue Type: Bug
Components: SCM HA
Reporter: Soumitra Sulav
Fix For: 1.4.0
*Issue:*
As per one of the Cloudera system test, 2 Datanode are scheduled for
decommission post data write and data pipeline close.
LEADER node has received the scheduled command for decommission as expected
from the test, But the FOLLOWER never received the decommission.
*Summary logs :*
Follower
{code:java}
19:58:04,931 : persistedOpState: DECOMMISSIONING, the value stored in SCM
(IN_SERVICE, 0)
19:58:10,016 : persistedOpState: IN_SERVICE, the value stored in SCM
(DECOMMISSIONING, 0)
{code}
Leader: TimeOut
{code:java}
2023-07-20 19:38:31,689 : persistedOpState: IN_SERVICE, the value stored in SCM
(DECOMMISSIONING, 0)
...... multiple retries .......
2023-07-20 19:55:54,323 : persistedOpState: IN_SERVICE, the value stored in SCM
(DECOMMISSIONING, 0)
2023-07-20 19:56:24,344 : persistedOpState: IN_SERVICE, the value stored in SCM
(DECOMMISSIONING, 0)
2023-07-20 19:58:04,931 : persistedOpState: DECOMMISSIONING, the value stored
in SCM (IN_SERVICE, 0)
{code}
*Detailed logs :*
{code:java}
FOLLOWER
2023-07-20 19:58:04,931 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager:
Update the operationalState saved in follower SCM for
33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host:
quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886,
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default-rack, certSerialId: 70976812254805668,
persistedOpState: DECOMMISSIONING, persistedOpStateExpiryEpochSec: 0} as the
reported value does not match the value stored in SCM (IN_SERVICE, 0)
2023-07-20 19:58:10,016 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager:
Update the operationalState saved in follower SCM for
33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host:
quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886,
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default-rack, certSerialId: 70976812254805668,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0} as the
reported value does not match the value stored in SCM (DECOMMISSIONING, 0)
LEADER
2023-07-20 19:56:24,344 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager:
Scheduling a command to update the operationalState persisted on
33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host:
quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886,
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default-rack, certSerialId: 70976812254805668,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0} as the
reported value does not match the value stored in SCM (DECOMMISSIONING, 0)
2023-07-20 19:58:04,931 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager:
Scheduling a command to update the operationalState persisted on
33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host:
quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886,
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default-rack, certSerialId: 70976812254805668,
persistedOpState: DECOMMISSIONING, persistedOpStateExpiryEpochSec: 0} as the
reported value does not match the value stored in SCM (IN_SERVICE, 0)
{code}
PFA SCM logs for more details
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]