[ 
https://issues.apache.org/jira/browse/HDDS-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyush Bhatt updated HDDS-9608:
---------------------------------
    Description: 
*Scenario:* Decommission and Recommission the same SCM node.

*Observation:*
{code:java}
ozone admin scm roles
2023-11-01 04:05:18,948|INFO|MainThread|machine.py:205 - 
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-2.ozn-decom202.xyz:1111:LEADER:aadb0a54-a86b-4be2-8fe1-9c61c4b8de3b:172.27.88.4
2023-11-01 04:05:18,949|INFO|MainThread|machine.py:205 - 
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-6.ozn-decom202.xyz:1111:FOLLOWER:93bcd687-ddff-448f-b778-636c2f8652a2:172.27.17.130
2023-11-01 04:05:18,949|INFO|MainThread|machine.py:205 - 
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-5.ozn-decom202.xyz:1111:FOLLOWER:a1bfdda0-c1b6-453d-91d0-9fdd3eee8041:172.27.204.67
 {code}
Node to decommission was: 
{code:java}
ozn-decom202-6.ozn-decom202.xyz (A primordial Node) {code}
ozn-decom202-5.ozn-decom202.xyz was made the new primordial node
{code:java}
'ozone.scm.primordial.node.id': 'ozn-decom202-5.ozn-decom202.xyz'{code}
All metadirs were deleted:
{code:java}
2023-11-01 04:15:03,829|INFO|MainThread|sudo -u root rm -rf 
/var/lib/hadoop-ozone/scm/data
2023-11-01 04:15:04,072|INFO|MainThread|sudo -u root rm -rf 
/var/lib/hadoop-ozone/scm/ratis
2023-11-01 04:15:04,285|INFO|MainThread|sudo -u root rm -rf 
/var/lib/hadoop-ozone/scm/ozone-metadata{code}
Node was removed:
{code:java}
2023-11-01 04:15:04,835|Successfully deleted role 
OZON1542132b-STORAGE_CONTAINER_MANAGER-68fe6978b07cabd016a5aeed2 from service 
OZONE-1 {code}
Same node was added back and was recommissioned:
{code:java}
2023-11-01 04:16:43,229|Created role_name = 
OZON1542132b-STORAGE_CONTAINER_MANAGER-68fe6978b07cabd016a5aeed2 for service = 
OZONE-1 on host = ozn-decom202-6.ozn-decom202.xyz {code}
SCM Bootstrap was successful as per SCM logs:
{code:java}
2023-11-01 04:18:52,598 INFO 
[main]-org.apache.hadoop.hdds.scm.ha.HASecurityUtils: Successfully stored SCM 
signed certificate.
2023-11-01 04:18:52,606 INFO 
[main]-org.apache.hadoop.hdds.scm.server.StorageContainerManager: SCM BootStrap 
 is successful for ClusterID CID-cb40013e-871a-4db6-85d6-d8a88831e5c9, SCMID 
fec84ffb-12fe-4339-8707-aebb6641cd1c
2023-11-01 04:18:52,606 INFO 
[main]-org.apache.hadoop.hdds.scm.server.StorageContainerManager: Primary SCM 
Node ID aadb0a54-a86b-4be2-8fe1-9c61c4b8de3b {code}
But soon after, SCM shuts down with InvalidStateTransitionException: Invalid 
event: CLOSE at OPEN state. (Thanks [~sumitagrawal] for debugging help)
{code:java}
2023-11-01 04:18:59,966 WARN 
[fec84ffb-12fe-4339-8707-aebb6641cd1c@group-D8A88831E5C9-StateMachineUpdater]-org.apache.hadoop.hdds.scm.ha.SequenceIdGenerator:
 Failed to allocate a batch for containerId, expected lastId is 0, actual 
lastId is 25000.
2023-11-01 04:18:59,971 ERROR 
[fec84ffb-12fe-4339-8707-aebb6641cd1c@group-D8A88831E5C9-StateMachineUpdater]-org.apache.ratis.statemachine.StateMachine:
 Terminating with exit status 1: Invalid event: CLOSE at OPEN state.
org.apache.hadoop.ozone.common.statemachine.InvalidStateTransitionException: 
Invalid event: CLOSE at OPEN state.
        at 
org.apache.hadoop.ozone.common.statemachine.StateMachine.getNextState(StateMachine.java:60)
        at 
org.apache.hadoop.hdds.scm.container.ContainerStateManagerImpl.updateContainerState(ContainerStateManagerImpl.java:356)
        at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
        at java.lang.Thread.run(Thread.java:748)
2023-11-01 04:18:59,975 INFO 
[shutdown-hook-0]-org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter:
 SHUTDOWN_MSG: {code}

  was:
*Scenario:* Decommission and Recommission the same SCM node.

*Observation:*
{code:java}
ozone admin scm roles
2023-11-01 04:05:18,948|INFO|MainThread|machine.py:205 - 
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-2.ozn-decom202.xyz:1111:LEADER:aadb0a54-a86b-4be2-8fe1-9c61c4b8de3b:172.27.88.4
2023-11-01 04:05:18,949|INFO|MainThread|machine.py:205 - 
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-6.ozn-decom202.xyz:1111:FOLLOWER:93bcd687-ddff-448f-b778-636c2f8652a2:172.27.17.130
2023-11-01 04:05:18,949|INFO|MainThread|machine.py:205 - 
run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-5.ozn-decom202.xyz:1111:FOLLOWER:a1bfdda0-c1b6-453d-91d0-9fdd3eee8041:172.27.204.67
 {code}
Node to decommission was: 
{code:java}
ozn-decom202-6.ozn-decom202.xyz (A primordial Node) {code}
ozn-decom202-5.ozn-decom202.xyz was made the new primordial node
{code:java}
'ozone.scm.primordial.node.id': 'ozn-decom202-5.ozn-decom202.xyz'{code}
All metadirs were deleted:
{code:java}
2023-11-01 04:15:03,829|INFO|MainThread|ozone.py:4297 - 
scmDecommissionedNodeCleanup()|All SCM Dirs to delete are: 
['/var/lib/hadoop-ozone/scm/data', '/var/lib/hadoop-ozone/scm/ratis', 
'/var/lib/hadoop-ozone/scm/ozone-metadata']
2023-11-01 04:15:03,830|INFO|MainThread|machine.py:190 - 
run()||GUID=944252c8-9252-410f-90f9-72c3f5163ba5|RUNNING: ssh -l root -i 
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o 
UserKnownHostsFile=/dev/null ozn-decom202-6.ozn-decom202.xyz "sudo -u root rm 
-rf /var/lib/hadoop-ozone/scm/data"
2023-11-01 04:15:04,072|INFO|MainThread|machine.py:232 - 
run()||GUID=944252c8-9252-410f-90f9-72c3f5163ba5|Exit Code: 0
2023-11-01 04:15:04,074|INFO|MainThread|machine.py:190 - 
run()||GUID=bee7796f-4069-4634-a949-a9a020a18553|RUNNING: ssh -l root -i 
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o 
UserKnownHostsFile=/dev/null ozn-decom202-6.ozn-decom202.xyz "sudo -u root rm 
-rf /var/lib/hadoop-ozone/scm/ratis"
2023-11-01 04:15:04,285|INFO|MainThread|machine.py:232 - 
run()||GUID=bee7796f-4069-4634-a949-a9a020a18553|Exit Code: 0
2023-11-01 04:15:04,287|INFO|MainThread|machine.py:190 - 
run()||GUID=8bf28f92-1b40-4af5-bcbe-dc106d87888a|RUNNING: ssh -l root -i 
/tmp/hw-qe-keypair.pem -q -o StrictHostKeyChecking=no -o 
UserKnownHostsFile=/dev/null ozn-decom202-6.ozn-decom202.xyz "sudo -u root rm 
-rf /var/lib/hadoop-ozone/scm/ozone-metadata" {code}
Node was removed:
{code:java}
2023-11-01 04:15:04,835|Successfully deleted role 
OZON1542132b-STORAGE_CONTAINER_MANAGER-68fe6978b07cabd016a5aeed2 from service 
OZONE-1 {code}
Same node was added back and was recommissioned:
{code:java}
2023-11-01 04:16:43,229|Created role_name = 
OZON1542132b-STORAGE_CONTAINER_MANAGER-68fe6978b07cabd016a5aeed2 for service = 
OZONE-1 on host = ozn-decom202-6.ozn-decom202.xyz {code}
SCM Bootstrap was successful as per SCM logs:
{code:java}
2023-11-01 04:18:52,598 INFO 
[main]-org.apache.hadoop.hdds.scm.ha.HASecurityUtils: Successfully stored SCM 
signed certificate.
2023-11-01 04:18:52,606 INFO 
[main]-org.apache.hadoop.hdds.scm.server.StorageContainerManager: SCM BootStrap 
 is successful for ClusterID CID-cb40013e-871a-4db6-85d6-d8a88831e5c9, SCMID 
fec84ffb-12fe-4339-8707-aebb6641cd1c
2023-11-01 04:18:52,606 INFO 
[main]-org.apache.hadoop.hdds.scm.server.StorageContainerManager: Primary SCM 
Node ID aadb0a54-a86b-4be2-8fe1-9c61c4b8de3b {code}
But soon after, SCM shuts down with InvalidStateTransitionException: Invalid 
event: CLOSE at OPEN state. (Thanks [~sumitagrawal] for debugging help)
{code:java}
2023-11-01 04:18:59,966 WARN 
[fec84ffb-12fe-4339-8707-aebb6641cd1c@group-D8A88831E5C9-StateMachineUpdater]-org.apache.hadoop.hdds.scm.ha.SequenceIdGenerator:
 Failed to allocate a batch for containerId, expected lastId is 0, actual 
lastId is 25000.
2023-11-01 04:18:59,971 ERROR 
[fec84ffb-12fe-4339-8707-aebb6641cd1c@group-D8A88831E5C9-StateMachineUpdater]-org.apache.ratis.statemachine.StateMachine:
 Terminating with exit status 1: Invalid event: CLOSE at OPEN state.
org.apache.hadoop.ozone.common.statemachine.InvalidStateTransitionException: 
Invalid event: CLOSE at OPEN state.
        at 
org.apache.hadoop.ozone.common.statemachine.StateMachine.getNextState(StateMachine.java:60)
        at 
org.apache.hadoop.hdds.scm.container.ContainerStateManagerImpl.updateContainerState(ContainerStateManagerImpl.java:356)
        at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
        at java.lang.Thread.run(Thread.java:748)
2023-11-01 04:18:59,975 INFO 
[shutdown-hook-0]-org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter:
 SHUTDOWN_MSG: {code}


> [MasterNode decommissioning] InvalidStateTransitionException after 
> recommissioning SCM
> --------------------------------------------------------------------------------------
>
>                 Key: HDDS-9608
>                 URL: https://issues.apache.org/jira/browse/HDDS-9608
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Pratyush Bhatt
>            Priority: Major
>
> *Scenario:* Decommission and Recommission the same SCM node.
> *Observation:*
> {code:java}
> ozone admin scm roles
> 2023-11-01 04:05:18,948|INFO|MainThread|machine.py:205 - 
> run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-2.ozn-decom202.xyz:1111:LEADER:aadb0a54-a86b-4be2-8fe1-9c61c4b8de3b:172.27.88.4
> 2023-11-01 04:05:18,949|INFO|MainThread|machine.py:205 - 
> run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-6.ozn-decom202.xyz:1111:FOLLOWER:93bcd687-ddff-448f-b778-636c2f8652a2:172.27.17.130
> 2023-11-01 04:05:18,949|INFO|MainThread|machine.py:205 - 
> run()||GUID=0825cc57-3a75-4632-b9e4-0ede9c2a30a6|ozn-decom202-5.ozn-decom202.xyz:1111:FOLLOWER:a1bfdda0-c1b6-453d-91d0-9fdd3eee8041:172.27.204.67
>  {code}
> Node to decommission was: 
> {code:java}
> ozn-decom202-6.ozn-decom202.xyz (A primordial Node) {code}
> ozn-decom202-5.ozn-decom202.xyz was made the new primordial node
> {code:java}
> 'ozone.scm.primordial.node.id': 'ozn-decom202-5.ozn-decom202.xyz'{code}
> All metadirs were deleted:
> {code:java}
> 2023-11-01 04:15:03,829|INFO|MainThread|sudo -u root rm -rf 
> /var/lib/hadoop-ozone/scm/data
> 2023-11-01 04:15:04,072|INFO|MainThread|sudo -u root rm -rf 
> /var/lib/hadoop-ozone/scm/ratis
> 2023-11-01 04:15:04,285|INFO|MainThread|sudo -u root rm -rf 
> /var/lib/hadoop-ozone/scm/ozone-metadata{code}
> Node was removed:
> {code:java}
> 2023-11-01 04:15:04,835|Successfully deleted role 
> OZON1542132b-STORAGE_CONTAINER_MANAGER-68fe6978b07cabd016a5aeed2 from service 
> OZONE-1 {code}
> Same node was added back and was recommissioned:
> {code:java}
> 2023-11-01 04:16:43,229|Created role_name = 
> OZON1542132b-STORAGE_CONTAINER_MANAGER-68fe6978b07cabd016a5aeed2 for service 
> = OZONE-1 on host = ozn-decom202-6.ozn-decom202.xyz {code}
> SCM Bootstrap was successful as per SCM logs:
> {code:java}
> 2023-11-01 04:18:52,598 INFO 
> [main]-org.apache.hadoop.hdds.scm.ha.HASecurityUtils: Successfully stored SCM 
> signed certificate.
> 2023-11-01 04:18:52,606 INFO 
> [main]-org.apache.hadoop.hdds.scm.server.StorageContainerManager: SCM 
> BootStrap  is successful for ClusterID 
> CID-cb40013e-871a-4db6-85d6-d8a88831e5c9, SCMID 
> fec84ffb-12fe-4339-8707-aebb6641cd1c
> 2023-11-01 04:18:52,606 INFO 
> [main]-org.apache.hadoop.hdds.scm.server.StorageContainerManager: Primary SCM 
> Node ID aadb0a54-a86b-4be2-8fe1-9c61c4b8de3b {code}
> But soon after, SCM shuts down with InvalidStateTransitionException: Invalid 
> event: CLOSE at OPEN state. (Thanks [~sumitagrawal] for debugging help)
> {code:java}
> 2023-11-01 04:18:59,966 WARN 
> [fec84ffb-12fe-4339-8707-aebb6641cd1c@group-D8A88831E5C9-StateMachineUpdater]-org.apache.hadoop.hdds.scm.ha.SequenceIdGenerator:
>  Failed to allocate a batch for containerId, expected lastId is 0, actual 
> lastId is 25000.
> 2023-11-01 04:18:59,971 ERROR 
> [fec84ffb-12fe-4339-8707-aebb6641cd1c@group-D8A88831E5C9-StateMachineUpdater]-org.apache.ratis.statemachine.StateMachine:
>  Terminating with exit status 1: Invalid event: CLOSE at OPEN state.
> org.apache.hadoop.ozone.common.statemachine.InvalidStateTransitionException: 
> Invalid event: CLOSE at OPEN state.
>         at 
> org.apache.hadoop.ozone.common.statemachine.StateMachine.getNextState(StateMachine.java:60)
>         at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManagerImpl.updateContainerState(ContainerStateManagerImpl.java:356)
>         at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
>         at 
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
>         at java.lang.Thread.run(Thread.java:748)
> 2023-11-01 04:18:59,975 INFO 
> [shutdown-hook-0]-org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter:
>  SHUTDOWN_MSG: {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to