[ 
https://issues.apache.org/jira/browse/HBASE-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823833#comment-17823833
 ] 

Duo Zhang commented on HBASE-28420:
-----------------------------------

{quote}
The Old active master didn't hang up. It finished the abortion in 30 seconds. 
But before finishing it acknowledged the report for remoteProcedureDone which 
should be acknowledged by the new active master to process further. 
{quote}

As I said above, if a master has already aborted, then it can not accept the 
report from region servers, as it will fail when persist the state to procedure 
store. When becoming the active master, the new active master will call 
recoverLease to finish the procedure store's wal so the old master can not 
write to it any more, this is the fencing way here.

So if you find out that the old active master can still accept report from 
region servers, then it is not dead yet. There is no problem for it to accept 
the report. Even if it is down immediately after persisting the state, the new 
active master will load the persisted state and move the procedure forward.

So I still do not fully understand what is going on here, why the old active 
master does not quit as it is aborted? Why the new active master hang when 
initializing AM? Because meta not online? What is the state of the SCP for the 
region server which holds the meta region?

> Aborting Active HMaster is not rejecting remote Procedure Reports
> -----------------------------------------------------------------
>
>                 Key: HBASE-28420
>                 URL: https://issues.apache.org/jira/browse/HBASE-28420
>             Project: HBase
>          Issue Type: Bug
>          Components: master, proc-v2
>    Affects Versions: 2.5.7
>            Reporter: Umesh Kumar Kumawat
>            Assignee: Umesh Kumar Kumawat
>            Priority: Critical
>
> When the Active Hmaster is in the process of abortion and another HMaster is 
> becoming Active HMaster,at the same time if any region server reports the 
> completion of the remote procedure, it generally goes to the old active 
> HMaster because of the cached value of rssStub -> 
> [code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829]
>  ([caller 
> method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]).
>  On the Master side 
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]),
>  It did check if the service is started but that returns true if the master 
> is in the process of abortion(I didn't see when we are setting this flag 
> false while abortion).  
> This issue becomes *critical* when *ServerCrash of meta hosting RS and master 
> failover* happens at the same time and hbase:meta got stuck in the offline 
> state.
> Logs for abortion start of HMaster 
> {noformat}
> 2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING 
> master server4-1xxx,61000,1705169084562:
> FAILED persisting region=52d36581218e00a2668776cfea897132 state=CLOSING 
> *****{noformat}
> {noformat}
> 2024-02-02 07:33:40,999 INFO [master/server4-1xxx:61000] 
> regionserver.HRegionServer - Exiting; 
> stopping=hbase2b-mnds4-1-ia2.ops.sfdc.net,61000,1705169084562; zookeeper 
> connection closed.{noformat}
> it took almost 30 seconds to abort the HMaster.
>  
> Logs of starting SCP for meta carrying host. (This SCP is started by the new 
> active HMaster)
> {noformat}
> 2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster] 
> assignment.AssignmentManager - Scheduled
> ServerCrashProcedure pid=3305546 for server5-1xxx61020,1706857451955 
> (carryingMeta=true) server5-1-
> xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write
>  
> locks = 1, Read locks = 0], oldState=ONLINE.{noformat}
> initialization of remote procedure
> {noformat}
> 2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[{pid=3305548, 
> ppid=3305547, state=RUNNABLE; SplitWALRemoteProcedure server5-1-
> xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta, 
> worker=server4-1-xxxx,61020,1705169180881}]{noformat}
> Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000) 
> (in the process of abortion)
> {noformat}
> 2024-02-02 07:33:37,990 DEBUG 
> [r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote 
> procedure 
> done, pid=3305548{noformat}
> This should be handled by the new active HMaster so that it can wake up the 
> suspended Procedure on the new Active Hmaster. As the new ActiveHMaster was 
> not able to wake that up, SCP procedure got stuck thus meta stayed OFFLINE. 
>  
> Logs of Hmaster trying to becomeActivehmaster but stuck-
> {noformat}
> 2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster] 
> master.HMaster - hbase:meta,,1.1588230740 
> is NOT online; state={1588230740 state=OPEN, ts=1706859212481, 
> server=server5-1-xxx,61020,1706857451955}; 
> ServerCrashProcedures=true. Master startup cannot progress, in 
> holding-pattern until region onlined.{noformat}
> After this master was stuck till we did hmaster failover to come out of this 
> situation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to