[ 
https://issues.apache.org/jira/browse/HDDS-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-11785:
-------------------------------
    Target Version/s: 2.0.0, 1.4.2

> DataNode aborts state machine because ContainerStateMachine does not know 
> follower's next index
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDDS-11785
>                 URL: https://issues.apache.org/jira/browse/HDDS-11785
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>
> We have a DataNode that encountered an exception removing state machine data. 
> After that, the state machine was closed and DataNode had no available 
> pipelines and became idle.
>  
> Eventually, SCM couldn't find any healthy DataNode and pipeline and couldn't 
> get out of safe mode after restart.
>  
> cc: [~szetszwo] seems it could only happen if hdds.datanode.
> wait.on.all.followers = true.
> {noformat}
> 2024-11-20 15:41:16,232 ERROR 
> [09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.StateMachineUpdater:
>  09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater 
> caught a Throwable.
> java.lang.RuntimeException: java.util.NoSuchElementException: No value present
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.removeStateMachineDataIfNeeded(ContainerStateMachine.java:880)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.notifyTermIndexUpdated(ContainerStateMachine.java:847)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1755)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.NoSuchElementException: No value present
>         at java.util.OptionalLong.getAsLong(OptionalLong.java:118)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.removeStateMachineDataIfNeeded(ContainerStateMachine.java:874)
>         ... 5 more
> 2024-11-20 15:41:16,233 INFO 
> [09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.RaftServer$Division:
>  09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D: shutdown
> 2024-11-20 15:41:16,233 INFO 
> [09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.util.JmxRegister:
>  Successfully un-registered JMX Bean with object name 
> Ratis:service=RaftServer,group=group-B6AD8655BA5D,id=09eca63b-ce87-43a5-ae29-a373c6c8791e
> 2024-11-20 15:41:16,233 INFO 
> [09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.RoleInfo:
>  09eca63b-ce87-43a5-ae29-a373c6c8791e: shutdown 
> 09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-LeaderStateImpl
> 2024-11-20 15:41:16,237 INFO 
> [09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.impl.PendingRequests:
>  09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-PendingRequests: 
> sendNotLeaderResponses
> 2024-11-20 15:41:16,239 INFO 
> [09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.RaftServer$Division:
>  09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D: closes. applyIndex: 
> 188
> 2024-11-20 15:41:17,232 INFO 
> [09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-StateMachineUpdater]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
>  
> 09eca63b-ce87-43a5-ae29-a373c6c8791e@group-B6AD8655BA5D-SegmentedRaftLogWorker
>  close() {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to