[ 
https://issues.apache.org/jira/browse/HBASE-29251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944137#comment-17944137
 ] 

Viraj Jasani commented on HBASE-29251:
--------------------------------------

The exception mentioned in the description comes from hdfs DataStreamer 
directly as it cannot get ack from pipelined datanodes. In a way, yes it can be 
said to be similar to that of WALSyncTimeoutIOException, although both have 
different origins.
{quote}This is a potential bug in FSHLog implementation in 2.x, where we will 
throw the exception out, in AsyncFSWAL and FSHLog in 3.x, we will not throw 
this exception out but retry forever in the WAL system.
{quote}
I see, are you talking about HBASE-27231?

It seems that aborting the master right away might not be bad idea, we need to 
make progress in persisting the proc state in some way and the lower num of 
retries could help only few times. The goal is to not get stuck here. The only 
catch is, sometimes we might run into the abort loops so hdfs health recovery 
in short duration is important.

Let me give more thoughts and create draft PR in a few days unless the 
consensus changes.

> SCP gets stuck forever if proc state cannot be persisted
> --------------------------------------------------------
>
>                 Key: HBASE-29251
>                 URL: https://issues.apache.org/jira/browse/HBASE-29251
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.4.18, 3.0.0-beta-1, 2.5.11, 2.6.2
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>
> When a given regionserver stops or aborts, the corresponding 
> ServerCrashProcedure is initiated by the active master. We have recently come 
> across a case where initial state of the SCP SERVER_CRASH_START could not be 
> persisted in the local region store:
> {code:java}
> 2025-04-09 19:00:23,538 ERROR [RegionServerTracker-0] 
> region.RegionProcedureStore - Failed to update proc pid=60020, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false
> java.io.InterruptedIOException: No ack received after 55s and a timeout of 55s
>     at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:938)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:692)
>     at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:580)
>     at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
>     at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:85)
>     at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:666)
>  {code}
>  
> This led to no further action on the SCP, it stayed stuck until the active 
> master was restarted manually.
> After the manual restart, new active master was able to proceed further with 
> SCP:
> {code:java}
> 2025-04-09 20:43:07,693 DEBUG [master/hmaster-3:60000:becomeActiveMaster] 
> procedure2.ProcedureExecutor - Stored pid=60771, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false
> 2025-04-09 20:44:15,312 INFO  [PEWorker-18] procedure2.ProcedureExecutor - 
> Finished pid=60771, state=SUCCESS; ServerCrashProcedure 
> server1,60020,1731526432248, splitWal=true, meta=false in 1 mins, 7.667 sec 
> {code}
>  
> While it is well known that for active master to be operate without 
> functional issues, the file system backing the master local region should be 
> healthy. It is however worth noting that hdfs can have issues and master 
> should be able to recover the procedures like SCP unless hdfs issues persist 
> for longer duration.
> A couple of proposals:
>  * Provide retries for the proc store persist failures
>  * Abort active master for new master to continue the recovery (deployment 
> systems usually ensure that the aborted servers are auto-started e.g. k8s or 
> ambari)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to