[ 
https://issues.apache.org/jira/browse/HBASE-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Kumar Kumawat updated HBASE-28420:
----------------------------------------
    Description: 
If the Active Hmaster is in the process of abortion and another HMaster is 
becoming Active HMaster.If at the same time region server reports the 
completion of the remote procedure, it generally goes to the old active HMaster 
because of the cached value of rssStub -> 
[code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829]
 ([caller 
method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]).
 On the Master side 
([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]),
 It did check if the service is started but that returns true if the master is 
in the process of abortion.  

This issue becomes *critical* when *ServerCrash of meta hosting RS and master 
failover* happens at the same time and hbase:meta got stuck in the offline 
state.

Logs for abortion start of HMaster 

 
{noformat}
2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING 
master server4-1xxx,61000,1705169084562:
FAILED persisting region=52d36581218e00a2668776cfea897132 state=CLOSING 
*****{noformat}
Logs of starting SCP for meta carrying host 
{noformat}
2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster] 
assignment.AssignmentManager - Scheduled
ServerCrashProcedure pid=3305546 for server5-1xxx61020,1706857451955 
(carryingMeta=true) server5-1-
xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write
 
locks = 1, Read locks = 0], oldState=ONLINE.{noformat}
initialization of remote procedure
{noformat}
2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor - 
Initialized subprocedures=[{pid=3305548, 
ppid=3305547, state=RUNNABLE; SplitWALRemoteProcedure server5-1-
xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta, 
worker=server4-1-xxxx,61020,1705169180881}]{noformat}
Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000) (in 
the process of abortion)
{noformat}
2024-02-02 07:33:37,990 DEBUG 
[r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote 
procedure 
done, pid=3305548{noformat}
Logs of Hmaster trying to becomeActivehmaster -

 
{noformat}
2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster] 
master.HMaster - hbase:meta,,1.1588230740 
is NOT online; state={1588230740 state=OPEN, ts=1706859212481, 
server=server5-1-xxx,61020,1706857451955}; 
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern 
until region onlined.{noformat}
After this master was stuck for almost 1 hour. We had to do hmaster failover to 
come out of this situation. 

  was:
If the Active Hmaster is in the process of abortion and another HMaster is 
becoming Active HMaster.If at the same time region server reports the 
completion of the remote procedure, it generally goes to the old active HMaster 
because of the cached value of rssStub -> 
[code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829]
 ([caller 
method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]).
 On the Master side 
([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]),
 It did check if the service is started but that returns true if the master is 
in the process of abortion.  

This issue becomes *critical* when *ServerCrash of meta hosting RS and master 
failover* happens at the same time and hbase:meta got stuck in the offline 
state.

Logs for abortion start of HMaster 
{noformat}
2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING 
master server4-1xxx,61000,1705169084562: FAILED persisting 
region=52d36581218e00a2668776cfea897132 state=CLOSING *****{noformat}
Logs of starting SCP for meta carrying host 
{noformat}
2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster] 
assignment.AssignmentManager - Scheduled ServerCrashProcedure pid=3305546 for 
server5-1xxx61020,1706857451955 (carryingMeta=true) 
server5-1-xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write
 locks = 1, Read locks = 0], oldState=ONLINE.{noformat}
initialization of remote procedure
{noformat}
2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor - 
Initialized subprocedures=[{pid=3305548, ppid=3305547, state=RUNNABLE; 
SplitWALRemoteProcedure 
server5-1-xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta, 
worker=server4-1-xxxx,61020,1705169180881}]{noformat}
Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000) (in 
the process of abortion)
{noformat}
2024-02-02 07:33:37,990 DEBUG 
[r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote 
procedure done, pid=3305548{noformat}
Logs of Hmaster trying to becomeActivehmaster -

 
{noformat}
2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster] 
master.HMaster - hbase:meta,,1.1588230740 is NOT online; state={1588230740 
state=OPEN, ts=1706859212481, server=server5-1-xxx,61020,1706857451955}; 
ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern 
until region onlined.{noformat}
After this master was stuck for almost 1 hour. We had to do hmaster failover to 
come out of this situation. 

 


> Aborting Active HMaster is not rejecting remote Procedure Reports
> -----------------------------------------------------------------
>
>                 Key: HBASE-28420
>                 URL: https://issues.apache.org/jira/browse/HBASE-28420
>             Project: HBase
>          Issue Type: Bug
>          Components: master, proc-v2
>    Affects Versions: 2.5.7
>            Reporter: Umesh Kumar Kumawat
>            Assignee: Umesh Kumar Kumawat
>            Priority: Critical
>
> If the Active Hmaster is in the process of abortion and another HMaster is 
> becoming Active HMaster.If at the same time region server reports the 
> completion of the remote procedure, it generally goes to the old active 
> HMaster because of the cached value of rssStub -> 
> [code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2829]
>  ([caller 
> method|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L3941]).
>  On the Master side 
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2381]),
>  It did check if the service is started but that returns true if the master 
> is in the process of abortion.  
> This issue becomes *critical* when *ServerCrash of meta hosting RS and master 
> failover* happens at the same time and hbase:meta got stuck in the offline 
> state.
> Logs for abortion start of HMaster 
>  
> {noformat}
> 2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING 
> master server4-1xxx,61000,1705169084562:
> FAILED persisting region=52d36581218e00a2668776cfea897132 state=CLOSING 
> *****{noformat}
> Logs of starting SCP for meta carrying host 
> {noformat}
> 2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster] 
> assignment.AssignmentManager - Scheduled
> ServerCrashProcedure pid=3305546 for server5-1xxx61020,1706857451955 
> (carryingMeta=true) server5-1-
> xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write
>  
> locks = 1, Read locks = 0], oldState=ONLINE.{noformat}
> initialization of remote procedure
> {noformat}
> 2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[{pid=3305548, 
> ppid=3305547, state=RUNNABLE; SplitWALRemoteProcedure server5-1-
> xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta, 
> worker=server4-1-xxxx,61020,1705169180881}]{noformat}
> Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000) 
> (in the process of abortion)
> {noformat}
> 2024-02-02 07:33:37,990 DEBUG 
> [r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote 
> procedure 
> done, pid=3305548{noformat}
> Logs of Hmaster trying to becomeActivehmaster -
>  
> {noformat}
> 2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster] 
> master.HMaster - hbase:meta,,1.1588230740 
> is NOT online; state={1588230740 state=OPEN, ts=1706859212481, 
> server=server5-1-xxx,61020,1706857451955}; 
> ServerCrashProcedures=true. Master startup cannot progress, in 
> holding-pattern until region onlined.{noformat}
> After this master was stuck for almost 1 hour. We had to do hmaster failover 
> to come out of this situation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to