[
https://issues.apache.org/jira/browse/HDDS-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18041979#comment-18041979
]
Ethan Rose commented on HDDS-13842:
-----------------------------------
If this incorporates HDDS-13980 and HDDS-13981 then it is expected to contain a
fix for those issues as well and they can be resolved. Otherwise, a different
link type should be used.
> Follower SCM does not comes out of safemode
> -------------------------------------------
>
> Key: HDDS-13842
> URL: https://issues.apache.org/jira/browse/HDDS-13842
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Affects Versions: 2.1.0
> Reporter: Sumit Agrawal
> Assignee: Sumit Agrawal
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.2.0
>
>
> Follower SCM never comes out of safe mode, as RPC server for DN HB is not
> started at SCM. Its started if certain action like leader change happens, Or
> leader is doing some update to SCM.
>
> scm3: follower on startup
>
> {code:java}
> 2025-10-15 17:39:57,265 INFO
> [main]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe
> mode.
> <-- SCM RPC server is not started and hence do not receive DN HB.
> 2025-10-15 17:41:10,344 INFO ScmDatanodeProtocol RPC server for DataNodes
> <-- RPC started as above log
> 2025-10-15 17:42:29,287 INFO
> [node3-EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM exiting safe mode.
> {code}
>
>
> Another observation where no action performed for 10 min:
>
> {code:java}
> 2025-10-25 19:19:34,473 INFO .. Entering startup safe mode.
> <-- 11 min delay as test case waiting for follower to exit safemode and no
> action
> 2025-10-25 19:30:47,921 INFO .. ScmDatanodeProtocol RPC server for DataNodes
> {code}
>
>
> This is as, its started on notifyTermIndexUpdated() call from ratis. below is
> call flow.
> {code:java}
> 2025-10-17 14:10:46,937 ERROR
> [b6e60709-ec61-4360-8fb3-65b2317949c0@group-29860CDEEB45-StateMachineUpdater]-org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer:
> heyho
> java.lang.Exception
> at
> org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.start(SCMDatanodeProtocolServer.java:199)
> at
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:364)
> at
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1848)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:252)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
> at java.lang.Thread.run(Thread.java:748) {code}
>
> As solution:
> 1) DatanodeProtocolServer can be started immediately when scm starts so that
> DN register/re-register can happen immediately
> 2) revert Ratis log metadata - ensure the termUpdate event happens for
> metadata changes also
> 2) StateMachineReadyRule triggered by StateMachine using refreshAndvalidate()
> all rules, ensure all rules are re-checked, and exit safemode status after
> start machine has flushed / applied all raft log on startup to avoid
> in-progress raft log transaction issue as referred by HDDS-5263.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]