Wei-Chiu Chuang created HDDS-11785:
--------------------------------------

             Summary: DataNode aborts state machine because 
ContainerStateMachine is unable to remove state machine data
                 Key: HDDS-11785
                 URL: https://issues.apache.org/jira/browse/HDDS-11785
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Wei-Chiu Chuang


We have a DataNode that encountered an exception removing state machine data. 
After that, the state machine was closed and DataNode had no available 
pipelines and became idle.

 

Eventually, SCM couldn't find any healthy DataNode and pipeline and couldn't 
get out of safe mode after restart.

 

cc: [~szetszwo] seems it could only happen if hdds.datanode.
wait.on.all.followers = true.
{noformat}
2024-11-20 15:41:16,232 ERROR 
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.StateMachineUpdater:
 09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater 
caught a Throwable.
java.lang.RuntimeException: java.util.NoSuchElementException: No value present
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.removeStateMachineDataIfNeeded(ContainerStateMachine.java:880)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.notifyTermIndexUpdated(ContainerStateMachine.java:847)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1755)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: No value present
        at java.util.OptionalLong.getAsLong(OptionalLong.java:118)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.removeStateMachineDataIfNeeded(ContainerStateMachine.java:874)
        ... 5 more
2024-11-20 15:41:16,233 INFO 
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.RaftServer$Division:
 09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D: shutdown
2024-11-20 15:41:16,233 INFO 
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.util.JmxRegister:
 Successfully un-registered JMX Bean with object name 
Ratis:service=RaftServer,group=group-B6AD8655BA5D,id=09eca63b-ce87-43a5-ae29-a373c6c8791e
2024-11-20 15:41:16,233 INFO 
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.RoleInfo:
 09eca63b-ce87-43a5-ae29-a373c6c8791e: shutdown 
09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-LeaderStateImpl
2024-11-20 15:41:16,237 INFO 
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.PendingRequests:
 09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-PendingRequests: 
sendNotLeaderResponses
2024-11-20 15:41:16,239 INFO 
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.RaftServer$Division:
 09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D: closes. applyIndex: 
188
2024-11-20 15:41:17,232 INFO 
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
 09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-SegmentedRaftLogWorker 
close() {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to