[
https://issues.apache.org/jira/browse/HBASE-29251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani updated HBASE-29251:
---------------------------------
Issue Type: Bug (was: Improvement)
> Procedure gets stuck if the procedure state cannot be persisted
> ---------------------------------------------------------------
>
> Key: HBASE-29251
> URL: https://issues.apache.org/jira/browse/HBASE-29251
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.18, 3.0.0-beta-1, 2.5.11, 2.6.2
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Critical
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12
>
>
> When a given regionserver stops or aborts, the corresponding
> ServerCrashProcedure is initiated by the active master. We have recently come
> across a case where initial state of the SCP SERVER_CRASH_START could not be
> persisted in the local region store:
> {code:java}
> 2025-04-09 19:00:23,538 ERROR [RegionServerTracker-0]
> region.RegionProcedureStore - Failed to update proc pid=60020,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server1,60020,1731526432248, splitWal=true, meta=false
> java.io.InterruptedIOException: No ack received after 55s and a timeout of 55s
> at
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:938)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:692)
> at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:580)
> at
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:85)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:666)
> {code}
>
> This led to no further action on the SCP, it stayed stuck until the active
> master was restarted manually.
> After the manual restart, new active master was able to proceed further with
> SCP:
> {code:java}
> 2025-04-09 20:43:07,693 DEBUG [master/hmaster-3:60000:becomeActiveMaster]
> procedure2.ProcedureExecutor - Stored pid=60771,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server1,60020,1731526432248, splitWal=true, meta=false
> 2025-04-09 20:44:15,312 INFO [PEWorker-18] procedure2.ProcedureExecutor -
> Finished pid=60771, state=SUCCESS; ServerCrashProcedure
> server1,60020,1731526432248, splitWal=true, meta=false in 1 mins, 7.667 sec
> {code}
>
> While it is well known that for active master to be operate without
> functional issues, the file system backing the master local region should be
> healthy. It is however worth noting that hdfs can have issues and master
> should be able to recover the procedures like SCP unless hdfs issues persist
> for longer duration.
> A couple of proposals:
> * Provide retries for the proc store persist failures
> * Abort active master for new master to continue the recovery (deployment
> systems usually ensure that the aborted servers are auto-started e.g. k8s or
> ambari)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)