[ 
https://issues.apache.org/jira/browse/HDDS-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Agrawal updated HDDS-13842:
---------------------------------
    Description: 
Follower SCM never comes out of safe mode, as RPC server for DN HB is not 
started at SCM. Its started if certain action like leader change happens.

 

scm3: follower on startup

 
{code:java}
2025-10-15 17:39:57,265 INFO 
[main]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe 
mode.
<-- SCM RPC server is not started and hence do not receive DN HB.
2025-10-15 17:41:10,344 INFO ScmDatanodeProtocol RPC server for DataNodes
<-- RPC started as above log
2025-10-15 17:42:29,287 INFO 
[node3-EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
 SCM exiting safe mode.
{code}
 

 

Another observation where no action performed for 10 min:

 
{code:java}
2025-10-25 19:19:34,473 INFO .. Entering startup safe mode.
<-- 11 min delay as test case waiting for follower to exit safemode and no 
action
2025-10-25 19:30:47,921 INFO .. ScmDatanodeProtocol RPC server for DataNodes
{code}
 

 

This is as, its started on notifyTermIndexUpdated() call from ratis. below is 
call flow.
{code:java}
2025-10-17 14:10:46,937 ERROR 
[b6e60709-ec61-4360-8fb3-65b2317949c0@group-29860CDEEB45-StateMachineUpdater]-org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer:
 heyho
java.lang.Exception
        at 
org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.start(SCMDatanodeProtocolServer.java:199)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:364)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1848)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:252)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
        at java.lang.Thread.run(Thread.java:748) {code}
 

 

  was:
Follower SCM never comes out of safe mode, as RPC server for DN HB is not 
started at SCM. Its started if certain action like leader change happens.

 

scm3: follower on startup

 
{code:java}
2025-10-15 17:39:57,265 INFO 
[main]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe 
mode.
<-- SCM RPC server is not started and hence do not receive DN HB.
2025-10-15 17:41:10,344 INFO ScmDatanodeProtocol RPC server for DataNodes
<-- RPC started as above log
2025-10-15 17:42:29,287 INFO 
[node3-EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
 SCM exiting safe mode.
{code}
 

 

Another observation where no action performed for 10 min:

 
{code:java}
2025-10-25 19:19:34,473 INFO .. Entering startup safe mode.
<-- 11 min delay as test case waiting for follower to exit safemode and no 
action
2025-10-25 19:30:47,921 INFO .. ScmDatanodeProtocol RPC server for DataNodes
{code}
 

 

This is as, its started on notifyTermIndexUpdated() call from ratis. below is 
call flow.
{code:java}
2025-10-17 14:10:46,937 ERROR 
[b6e60709-ec61-4360-8fb3-65b2317949c0@group-29860CDEEB45-StateMachineUpdater]-org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer:
 heyho
java.lang.Exception
        at 
org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.start(SCMDatanodeProtocolServer.java:199)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:364)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1848)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:252)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
        at java.lang.Thread.run(Thread.java:748) {code}
 

This is induced after SCM metadata write is disabled.

HDDS-13281. Disable Ratis metadata write to Raft Log on OM & SCM.


> Follower SCM does not comes out of safemode for disable Raits metadata write
> ----------------------------------------------------------------------------
>
>                 Key: HDDS-13842
>                 URL: https://issues.apache.org/jira/browse/HDDS-13842
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 2.1.0
>            Reporter: Sumit Agrawal
>            Assignee: Sumit Agrawal
>            Priority: Major
>              Labels: pull-request-available
>
> Follower SCM never comes out of safe mode, as RPC server for DN HB is not 
> started at SCM. Its started if certain action like leader change happens.
>  
> scm3: follower on startup
>  
> {code:java}
> 2025-10-15 17:39:57,265 INFO 
> [main]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Entering startup safe 
> mode.
> <-- SCM RPC server is not started and hence do not receive DN HB.
> 2025-10-15 17:41:10,344 INFO ScmDatanodeProtocol RPC server for DataNodes
> <-- RPC started as above log
> 2025-10-15 17:42:29,287 INFO 
> [node3-EventQueue-ContainerRegistrationReportForRatisContainerSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM exiting safe mode.
> {code}
>  
>  
> Another observation where no action performed for 10 min:
>  
> {code:java}
> 2025-10-25 19:19:34,473 INFO .. Entering startup safe mode.
> <-- 11 min delay as test case waiting for follower to exit safemode and no 
> action
> 2025-10-25 19:30:47,921 INFO .. ScmDatanodeProtocol RPC server for DataNodes
> {code}
>  
>  
> This is as, its started on notifyTermIndexUpdated() call from ratis. below is 
> call flow.
> {code:java}
> 2025-10-17 14:10:46,937 ERROR 
> [b6e60709-ec61-4360-8fb3-65b2317949c0@group-29860CDEEB45-StateMachineUpdater]-org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer:
>  heyho
> java.lang.Exception
>         at 
> org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.start(SCMDatanodeProtocolServer.java:199)
>         at 
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:364)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1848)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:252)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:193)
>         at java.lang.Thread.run(Thread.java:748) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to