[
https://issues.apache.org/jira/browse/HBASE-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823640#comment-17823640
]
Umesh Kumar Kumawat commented on HBASE-28420:
---------------------------------------------
>>Does the new 'active' master actually finish the active master initialization?
The new 'active' master was not able to finish the initialization. It got stuck
while starting AM
([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1069]).
Before that, it has started other master services.
>>If so, I think region server will start to report to the new active master,
>>as the old active master will fail to persistent to procedure store and the
>>procedure report will fail.
Region servers always try to report old master only till they don't get an
error. First, they check if master services are
running([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381])
which was true in this case as I don't see us setting this flag to false while
aborting the master. Later I didn't see any failure in reporting, maybe there
is no persistence while
reporting([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerRemoteProcedure.java#L130]),
I see it waking up an event.
>>If not, then the problem is why the old active master hangs there for 1 hour,
>>without letting other active masters take the charge...
The Old active master didn't hang up. It finished the abortion in 30 seconds.
But before finishing it acknowledged the report for remoteProcedureDone which
should be acknowledged by the new active master to process further.
> Aborting Active HMaster is not rejecting remote Procedure Reports
> -----------------------------------------------------------------
>
> Key: HBASE-28420
> URL: https://issues.apache.org/jira/browse/HBASE-28420
> Project: HBase
> Issue Type: Bug
> Components: master, proc-v2
> Affects Versions: 2.5.7
> Reporter: Umesh Kumar Kumawat
> Assignee: Umesh Kumar Kumawat
> Priority: Critical
>
> If the Active Hmaster is in the process of abortion and another HMaster is
> becoming Active HMaster.If at the same time region server reports the
> completion of the remote procedure, it generally goes to the old active
> HMaster because of the cached value of rssStub ->
> [code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829]
> ([caller
> method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]).
> On the Master side
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]),
> It did check if the service is started but that returns true if the master
> is in the process of abortion.
> This issue becomes *critical* when *ServerCrash of meta hosting RS and master
> failover* happens at the same time and hbase:meta got stuck in the offline
> state.
> Logs for abortion start of HMaster
>
> {noformat}
> 2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING
> master server4-1xxx,61000,1705169084562:
> FAILED persisting region=52d36581218e00a2668776cfea897132 state=CLOSING
> *****{noformat}
> Logs of starting SCP for meta carrying host
> {noformat}
> 2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster]
> assignment.AssignmentManager - Scheduled
> ServerCrashProcedure pid=3305546 for server5-1xxx61020,1706857451955
> (carryingMeta=true) server5-1-
> xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write
>
> locks = 1, Read locks = 0], oldState=ONLINE.{noformat}
> initialization of remote procedure
> {noformat}
> 2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor -
> Initialized subprocedures=[{pid=3305548,
> ppid=3305547, state=RUNNABLE; SplitWALRemoteProcedure server5-1-
> xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta,
> worker=server4-1-xxxx,61020,1705169180881}]{noformat}
> Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000)
> (in the process of abortion)
> {noformat}
> 2024-02-02 07:33:37,990 DEBUG
> [r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote
> procedure
> done, pid=3305548{noformat}
> Logs of Hmaster trying to becomeActivehmaster -
>
> {noformat}
> 2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster]
> master.HMaster - hbase:meta,,1.1588230740
> is NOT online; state={1588230740 state=OPEN, ts=1706859212481,
> server=server5-1-xxx,61020,1706857451955};
> ServerCrashProcedures=true. Master startup cannot progress, in
> holding-pattern until region onlined.{noformat}
> After this master was stuck for almost 1 hour. We had to do hmaster failover
> to come out of this situation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)