Viraj Jasani created HBASE-29251:
------------------------------------

             Summary: SCP gets stuck forever if proc state cannot be persisted
                 Key: HBASE-29251
                 URL: https://issues.apache.org/jira/browse/HBASE-29251
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 2.6.2, 2.5.11, 3.0.0-beta-1, 2.4.18
            Reporter: Viraj Jasani


When a given regionserver stops or aborts, the corresponding 
ServerCrashProcedure is initiated by the active master. We have recently come 
across a case where initial state of the SCP SERVER_CRASH_START could not be 
persisted in the local region store:
{code:java}
2025-04-09 19:00:23,538 ERROR [RegionServerTracker-0] 
region.RegionProcedureStore - Failed to update proc pid=60020, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server1,60020,1731526432248, splitWal=true, meta=false
java.io.InterruptedIOException: No ack received after 55s and a timeout of 55s
    at 
org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:938)
    at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:692)
    at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:580)
    at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
    at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:85)
    at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:666) 
{code}
 

This led to no further action on the SCP, it stayed stuck until the active 
master was restarted manually.

After the manual restart, new active master was able to proceed further with 
SCP:
{code:java}
2025-04-09 20:43:07,693 DEBUG [master/hmaster-3:60000:becomeActiveMaster] 
procedure2.ProcedureExecutor - Stored pid=60771, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server1,60020,1731526432248, splitWal=true, meta=false

2025-04-09 20:44:15,312 INFO  [PEWorker-18] procedure2.ProcedureExecutor - 
Finished pid=60771, state=SUCCESS; ServerCrashProcedure 
server1,60020,1731526432248, splitWal=true, meta=false in 1 mins, 7.667 sec 
{code}
 

While it is well known that for active master to be operate without functional 
issues, the file system backing the master local region should be healthy. It 
is however worth noting that hdfs can have issues and master should be able to 
recover the procedures like SCP unless hdfs issues persist for longer duration.

A couple of proposals:
 * Provide retries for the proc store persist failures
 * Abort active master for new master to continue the recovery (deployment 
systems usually ensure that the aborted servers are auto-started e.g. k8s or 
ambari)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to