Pratyush Bhatt created HDDS-9608:
------------------------------------
Summary: [MasterNode decommissioning]
InvalidStateTransitionException after recommissioning SCM
Key: HDDS-9608
URL: https://issues.apache.org/jira/browse/HDDS-9608
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Reporter: Pratyush Bhatt
*Scenario:* Decommission and Recommission the same SCM node.
*Observation:*
{code:java}
ozone admin scm roles
2023-11-01 04:05:18,948|INFO|MainThread|machine.py:205 -
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-2.ozn-decom202.xyz:1111:LEADER:aadb0a54-a86b-4be2-8fe1-9c61c4b8de3b:172.27.88.4
2023-11-01 04:05:18,949|INFO|MainThread|machine.py:205 -
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-6.ozn-decom202.xyz:1111:FOLLOWER:93bcd687-ddff-448f-b778-636c2f8652a2:172.27.17.130
2023-11-01 04:05:18,949|INFO|MainThread|machine.py:205 -
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-5.ozn-decom202.xyz:1111:FOLLOWER:a1bfdda0-c1b6-453d-91d0-9fdd3eee8041:172.27.204.67
{code}
Node to decommission was:
{code:java}
ozn-decom202-6.ozn-decom202.xyz (A primordial Node) {code}
ozn-decom202-5.ozn-decom202.xyz was made the new primordial node
{code:java}
'ozone.scm.primordial.node.id': 'ozn-decom202-5.ozn-decom202.xyz'{code}
All metadirs were deleted:
{code:java}
2023-11-01 04:15:03,829|INFO|MainThread|ozone.py:4297 -
scmDecommissionedNodeCleanup()|All SCM Dirs to delete are:
['/var/lib/hadoop-ozone/scm/data', '/var/lib/hadoop-ozone/scm/ratis',
'/var/lib/hadoop-ozone/scm/ozone-metadata']
2023-11-01 04:15:03,830|INFO|MainThread|machine.py:190 -
run()||GUID=944252c8-9252-410f-90f9-72c3f5163ba5|RUNNING: ssh -l root -i
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o
UserKnownHostsFile=/dev/null ozn-decom202-6.ozn-decom202.xyz "sudo -u root rm
-rf /var/lib/hadoop-ozone/scm/data"
2023-11-01 04:15:04,072|INFO|MainThread|machine.py:232 -
run()||GUID=944252c8-9252-410f-90f9-72c3f5163ba5|Exit Code: 0
2023-11-01 04:15:04,074|INFO|MainThread|machine.py:190 -
run()||GUID=bee7796f-4069-4634-a949-a9a020a18553|RUNNING: ssh -l root -i
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o
UserKnownHostsFile=/dev/null ozn-decom202-6.ozn-decom202.xyz "sudo -u root rm
-rf /var/lib/hadoop-ozone/scm/ratis"
2023-11-01 04:15:04,285|INFO|MainThread|machine.py:232 -
run()||GUID=bee7796f-4069-4634-a949-a9a020a18553|Exit Code: 0
2023-11-01 04:15:04,287|INFO|MainThread|machine.py:190 -
run()||GUID=8bf28f92-1b40-4af5-bcbe-dc106d87888a|RUNNING: ssh -l root -i
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o
UserKnownHostsFile=/dev/null ozn-decom202-6.ozn-decom202.xyz "sudo -u root rm
-rf /var/lib/hadoop-ozone/scm/ozone-metadata" {code}
Node was removed:
{code:java}
2023-11-01 04:15:04,835|Successfully deleted role
OZON1542132b-STORAGE_CONTAINER_MANAGER-68fe6978b07cabd016a5aeed2 from service
OZONE-1 {code}
Same node was added back and was recommissioned:
{code:java}
2023-11-01 04:16:43,229|Created role_name =
OZON1542132b-STORAGE_CONTAINER_MANAGER-68fe6978b07cabd016a5aeed2 for service =
OZONE-1 on host = ozn-decom202-6.ozn-decom202.xyz {code}
SCM Bootstrap was successful as per SCM logs:
{code:java}
2023-11-01 04:18:52,598 INFO
[main]-org.apache.hadoop.hdds.scm.ha.HASecurityUtils: Successfully stored SCM
signed certificate.
2023-11-01 04:18:52,606 INFO
[main]-org.apache.hadoop.hdds.scm.server.StorageContainerManager: SCM BootStrap
is successful for ClusterID CID-cb40013e-871a-4db6-85d6-d8a88831e5c9, SCMID
fec84ffb-12fe-4339-8707-aebb6641cd1c
2023-11-01 04:18:52,606 INFO
[main]-org.apache.hadoop.hdds.scm.server.StorageContainerManager: Primary SCM
Node ID aadb0a54-a86b-4be2-8fe1-9c61c4b8de3b {code}
But soon after, SCM shuts down with InvalidStateTransitionException: Invalid
event: CLOSE at OPEN state. (Thanks [~sumitagrawal] for debugging help)
{code:java}
2023-11-01 04:18:59,966 WARN
[fec84ffb-12fe-4339-8707-aebb6641cd1c@group-D8A88831E5C9-StateMachineUpdater]-org.apache.hadoop.hdds.scm.ha.SequenceIdGenerator:
Failed to allocate a batch for containerId, expected lastId is 0, actual
lastId is 25000.
2023-11-01 04:18:59,971 ERROR
[fec84ffb-12fe-4339-8707-aebb6641cd1c@group-D8A88831E5C9-StateMachineUpdater]-org.apache.ratis.statemachine.StateMachine:
Terminating with exit status 1: Invalid event: CLOSE at OPEN state.
org.apache.hadoop.ozone.common.statemachine.InvalidStateTransitionException:
Invalid event: CLOSE at OPEN state.
at
org.apache.hadoop.ozone.common.statemachine.StateMachine.getNextState(StateMachine.java:60)
at
org.apache.hadoop.hdds.scm.container.ContainerStateManagerImpl.updateContainerState(ContainerStateManagerImpl.java:356)
at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
at
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
at
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
at java.lang.Thread.run(Thread.java:748)
2023-11-01 04:18:59,975 INFO
[shutdown-hook-0]-org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter:
SHUTDOWN_MSG: {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]