[jira] [Updated] (HDDS-13842) Follower SCM does not comes out of safemode

Sumit Agrawal (Jira) Thu, 06 Nov 2025 04:33:48 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sumit Agrawal updated HDDS-13842:
---------------------------------
    Description: 
Follower SCM never comes out of safe mode, as RPC server for DN HB is not 
started at SCM. Its started if certain action like leader change happens, Or 
leader is doing some update to SCM.

 

scm3: follower on startup

 
{code:java}
2025-10-15 17:39:57,265 INFO 
[main]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe 
mode.
<-- SCM RPC server is not started and hence do not receive DN HB.
2025-10-15 17:41:10,344 INFO ScmDatanodeProtocol RPC server for DataNodes
<-- RPC started as above log
2025-10-15 17:42:29,287 INFO 
[node3-EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
 SCM exiting safe mode.
{code}
 

 

Another observation where no action performed for 10 min:

 
{code:java}
2025-10-25 19:19:34,473 INFO .. Entering startup safe mode.
<-- 11 min delay as test case waiting for follower to exit safemode and no 
action
2025-10-25 19:30:47,921 INFO .. ScmDatanodeProtocol RPC server for DataNodes
{code}
 

 

This is as, its started on notifyTermIndexUpdated() call from ratis. below is 
call flow.
{code:java}
2025-10-17 14:10:46,937 ERROR 
[b6e60709-ec61-4360-8fb3-65b2317949c0@group-29860CDEEB45-StateMachineUpdater]-org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer:
 heyho
java.lang.Exception
        at 
org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.start(SCMDatanodeProtocolServer.java:199)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:364)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1848)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:252)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
        at java.lang.Thread.run(Thread.java:748) {code}
 

As solution:

1) DatanodeProtocolServer can be started immediately when scm starts so that DN 
register/re-register can happen immediately

2) revert Ratis log metadata - ensure the termUpdate event happens for metadata 
changes also

2) StateMachineReadyRule triggered by StateMachine using refreshAndvalidate() 
all rules, ensure all rules are re-checked, and exit safemode status after 
start machine has flushed / applied all raft log on startup to avoid 
in-progress raft log transaction issue as referred by HDDS-5263.

 

 

  was:
Follower SCM never comes out of safe mode, as RPC server for DN HB is not 
started at SCM. Its started if certain action like leader change happens.

 

scm3: follower on startup

 
{code:java}
2025-10-15 17:39:57,265 INFO 
[main]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe 
mode.
<-- SCM RPC server is not started and hence do not receive DN HB.
2025-10-15 17:41:10,344 INFO ScmDatanodeProtocol RPC server for DataNodes
<-- RPC started as above log
2025-10-15 17:42:29,287 INFO 
[node3-EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
 SCM exiting safe mode.
{code}
 

 

Another observation where no action performed for 10 min:

 
{code:java}
2025-10-25 19:19:34,473 INFO .. Entering startup safe mode.
<-- 11 min delay as test case waiting for follower to exit safemode and no 
action
2025-10-25 19:30:47,921 INFO .. ScmDatanodeProtocol RPC server for DataNodes
{code}
 

 

This is as, its started on notifyTermIndexUpdated() call from ratis. below is 
call flow.
{code:java}
2025-10-17 14:10:46,937 ERROR 
[b6e60709-ec61-4360-8fb3-65b2317949c0@group-29860CDEEB45-StateMachineUpdater]-org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer:
 heyho
java.lang.Exception
        at 
org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.start(SCMDatanodeProtocolServer.java:199)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:364)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1848)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:252)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
        at java.lang.Thread.run(Thread.java:748) {code}
 

 


> Follower SCM does not comes out of safemode
> -------------------------------------------
>
>                 Key: HDDS-13842
>                 URL: https://issues.apache.org/jira/browse/HDDS-13842
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 2.1.0
>            Reporter: Sumit Agrawal
>            Assignee: Sumit Agrawal
>            Priority: Major
>              Labels: pull-request-available
>
> Follower SCM never comes out of safe mode, as RPC server for DN HB is not 
> started at SCM. Its started if certain action like leader change happens, Or 
> leader is doing some update to SCM.
>  
> scm3: follower on startup
>  
> {code:java}
> 2025-10-15 17:39:57,265 INFO 
> [main]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe 
> mode.
> <-- SCM RPC server is not started and hence do not receive DN HB.
> 2025-10-15 17:41:10,344 INFO ScmDatanodeProtocol RPC server for DataNodes
> <-- RPC started as above log
> 2025-10-15 17:42:29,287 INFO 
> [node3-EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM exiting safe mode.
> {code}
>  
>  
> Another observation where no action performed for 10 min:
>  
> {code:java}
> 2025-10-25 19:19:34,473 INFO .. Entering startup safe mode.
> <-- 11 min delay as test case waiting for follower to exit safemode and no 
> action
> 2025-10-25 19:30:47,921 INFO .. ScmDatanodeProtocol RPC server for DataNodes
> {code}
>  
>  
> This is as, its started on notifyTermIndexUpdated() call from ratis. below is 
> call flow.
> {code:java}
> 2025-10-17 14:10:46,937 ERROR 
> [b6e60709-ec61-4360-8fb3-65b2317949c0@group-29860CDEEB45-StateMachineUpdater]-org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer:
>  heyho
> java.lang.Exception
>         at 
> org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.start(SCMDatanodeProtocolServer.java:199)
>         at 
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:364)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1848)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:252)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
>         at java.lang.Thread.run(Thread.java:748) {code}
>  
> As solution:
> 1) DatanodeProtocolServer can be started immediately when scm starts so that 
> DN register/re-register can happen immediately
> 2) revert Ratis log metadata - ensure the termUpdate event happens for 
> metadata changes also
> 2) StateMachineReadyRule triggered by StateMachine using refreshAndvalidate() 
> all rules, ensure all rules are re-checked, and exit safemode status after 
> start machine has flushed / applied all raft log on startup to avoid 
> in-progress raft log transaction issue as referred by HDDS-5263.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-13842) Follower SCM does not comes out of safemode

Reply via email to