Wei-Chiu Chuang created HDDS-11785:
--------------------------------------
Summary: DataNode aborts state machine because
ContainerStateMachine is unable to remove state machine data
Key: HDDS-11785
URL: https://issues.apache.org/jira/browse/HDDS-11785
Project: Apache Ozone
Issue Type: Bug
Reporter: Wei-Chiu Chuang
We have a DataNode that encountered an exception removing state machine data.
After that, the state machine was closed and DataNode had no available
pipelines and became idle.
Eventually, SCM couldn't find any healthy DataNode and pipeline and couldn't
get out of safe mode after restart.
cc: [~szetszwo] seems it could only happen if hdds.datanode.
wait.on.all.followers = true.
{noformat}
2024-11-20 15:41:16,232 ERROR
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.StateMachineUpdater:
09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater
caught a Throwable.
java.lang.RuntimeException: java.util.NoSuchElementException: No value present
at
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.removeStateMachineDataIfNeeded(ContainerStateMachine.java:880)
at
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.notifyTermIndexUpdated(ContainerStateMachine.java:847)
at
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1755)
at
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: No value present
at java.util.OptionalLong.getAsLong(OptionalLong.java:118)
at
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.removeStateMachineDataIfNeeded(ContainerStateMachine.java:874)
... 5 more
2024-11-20 15:41:16,233 INFO
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.RaftServer$Division:
09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D: shutdown
2024-11-20 15:41:16,233 INFO
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.util.JmxRegister:
Successfully un-registered JMX Bean with object name
Ratis:service=RaftServer,group=group-B6AD8655BA5D,id=09eca63b-ce87-43a5-ae29-a373c6c8791e
2024-11-20 15:41:16,233 INFO
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.RoleInfo:
09eca63b-ce87-43a5-ae29-a373c6c8791e: shutdown
09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-LeaderStateImpl
2024-11-20 15:41:16,237 INFO
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.PendingRequests:
09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-PendingRequests:
sendNotLeaderResponses
2024-11-20 15:41:16,239 INFO
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.RaftServer$Division:
09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D: closes. applyIndex:
188
2024-11-20 15:41:17,232 INFO
[09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-SegmentedRaftLogWorker
close() {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]