[ 
https://issues.apache.org/jira/browse/HBASE-21614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21614:
-------------------------------------
    Description: 
Master is restarting after a previous master crashed while recovering some 
regions from a dead server.

Master recovers RIT for the region, however the RIT has no location (logged, at 
least) in CONFIRM_CLOSE state. That is a potential problem #0.5 - confirm 
where? But that should be covered by meta, so not a big deal, right. As such it 
doesn't seem to add the region to server map anywhere
{noformat}
2018-12-17 14:51:14,606 INFO  [master/:17000:becomeActiveMaster] 
assignment.AssignmentManager: Attach pid=38015, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=false; 
TransitRegionStateProcedure table=t1, region=region1, REOPEN/MOVE to 
rit=OFFLINE, location=null, table=t1, region=region1 to restore RIT
{noformat}

However, in this case ServerCrashProcedure for the server kicks off BEFORE meta 
is loaded.
That seems to be a problem #1 - it immediately gets regions to later recover, 
so in this case it gets nothing.
I've grepped our logs for successful cases of SCP interacting with region 
transition at startup, and in all cases the meta was loaded before SCP.
Seems like a race condition.
{noformat}
2018-12-17 14:51:14,625 INFO  [master/:17000:becomeActiveMaster] 
master.RegionServerTracker: Starting RegionServerTracker; 0 have existing 
ServerCrashProcedures, 103 possibly 'live' servers, and 1 'splitting'.
2018-12-17 14:51:20,770 INFO  [master/:17000:becomeActiveMaster] 
master.ServerManager: Processing expiration of server1,17020,1544636616174 on 
,17000,1545087053243
2018-12-17 14:51:20,921 INFO  [master/:17000:becomeActiveMaster] 
assignment.AssignmentManager: Added server1,17020,1544636616174 to dead servers 
which carryingMeta=false, submitted ServerCrashProcedure pid=111298
2018-12-17 14:51:30,728 INFO  [PEWorker-13] procedure.ServerCrashProcedure: 
Start pid=111298, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; 
ServerCrashProcedure server=server1,17020,1544636616174, splitWal=true, 
meta=false
{noformat}
Meta is only loaded 11-12 seconds later.
If one looks at meta-loading code however, there is one more problem #2 - the 
region is in CLOSING state, so the {{addRegionToServer}} is not going to be 
called - it's only called for OPENED regions. 
Expanding on the above, I've only seen SCP unblock stuck region transition at 
startup when region started out in meta as OPEN.
{noformat}
2018-12-17 14:51:42,403 INFO  [master/:17000:becomeActiveMaster] 
assignment.RegionStateStore: Load hbase:meta entry region=region1, 
regionState=CLOSING, lastHost=server1,17020,1544636616174, 
regionLocation=server1,17020,1544636616174, openSeqNum=629131
{noformat}
SCP predictably finishes without doing anything; no other logs for this pid
{noformat}
2018-12-17 14:52:19,046 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished pid=111298, state=SUCCESS, hasLock=false; ServerCrashProcedure 
server=server1,17020,1544636616174, splitWal=true, meta=false in 58.0010sec
{noformat}
After that, region is still stuck trying to be closed in 
TransitRegionStateProcedure; it's in the same state for hours including across 
master restarts.
{noformat}
2018-12-17 15:09:35,216 WARN  [PEWorker-14] 
assignment.TransitRegionStateProcedure: Failed transition, suspend 604secs 
pid=38015, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE, hasLock=true; 
TransitRegionStateProcedure table=t1, region=region1, REOPEN/MOVE; rit=CLOSING, 
location=server1,17020,1544636616174; waiting on rectified condition fixed by 
other Procedure or operator intervention
{noformat}



  was:
Master is restarting after a previous master crashed while recovering some 
regions from a dead server.

Master recovers RIT for the region, however the RIT has no location (logged, at 
least) in CONFIRM_CLOSE state. That is a potential problem #0.5 - confirm 
where? But that should be covered by meta, so not a big deal, right.
{noformat}
2018-12-17 14:51:14,606 INFO  [master/:17000:becomeActiveMaster] 
assignment.AssignmentManager: Attach pid=38015, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=false; 
TransitRegionStateProcedure table=t1, region=region1, REOPEN/MOVE to 
rit=OFFLINE, location=null, table=t1, region=region1 to restore RIT
{noformat}

However, in this case ServerCrashProcedure for the server kicks off BEFORE meta 
is loaded.
That seems to be a problem #1 - it immediately gets regions to later recover, 
so in this case it gets nothing.
I've grepped our logs for successful cases of SCP interacting with region 
transition at startup, and in all cases the meta was loaded before SCP.
Seems like a race condition.
{noformat}
2018-12-17 14:51:14,625 INFO  [master/:17000:becomeActiveMaster] 
master.RegionServerTracker: Starting RegionServerTracker; 0 have existing 
ServerCrashProcedures, 103 possibly 'live' servers, and 1 'splitting'.
2018-12-17 14:51:20,770 INFO  [master/:17000:becomeActiveMaster] 
master.ServerManager: Processing expiration of server1,17020,1544636616174 on 
,17000,1545087053243
2018-12-17 14:51:20,921 INFO  [master/:17000:becomeActiveMaster] 
assignment.AssignmentManager: Added server1,17020,1544636616174 to dead servers 
which carryingMeta=false, submitted ServerCrashProcedure pid=111298
2018-12-17 14:51:30,728 INFO  [PEWorker-13] procedure.ServerCrashProcedure: 
Start pid=111298, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; 
ServerCrashProcedure server=server1,17020,1544636616174, splitWal=true, 
meta=false
{noformat}
Meta is only loaded 11-12 seconds later.
If one looks at meta-loading code however, there is one more problem #2 - the 
region is in CLOSING state, so the {{addRegionToServer}} is not going to be 
called - it's only called for OPENED regions. 
Expanding on the above, I've only seen SCP unblock stuck region transition at 
startup when region started out in meta as OPEN.
{noformat}
2018-12-17 14:51:42,403 INFO  [master/:17000:becomeActiveMaster] 
assignment.RegionStateStore: Load hbase:meta entry region=region1, 
regionState=CLOSING, lastHost=server1,17020,1544636616174, 
regionLocation=server1,17020,1544636616174, openSeqNum=629131
{noformat}
SCP predictably finishes without doing anything; no other logs for this pid
{noformat}
2018-12-17 14:52:19,046 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Finished pid=111298, state=SUCCESS, hasLock=false; ServerCrashProcedure 
server=server1,17020,1544636616174, splitWal=true, meta=false in 58.0010sec
{noformat}
After that, region is still stuck trying to be closed in 
TransitRegionStateProcedure; it's in the same state for hours including across 
master restarts.
{noformat}
2018-12-17 15:09:35,216 WARN  [PEWorker-14] 
assignment.TransitRegionStateProcedure: Failed transition, suspend 604secs 
pid=38015, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE, hasLock=true; 
TransitRegionStateProcedure table=t1, region=region1, REOPEN/MOVE; rit=CLOSING, 
location=server1,17020,1544636616174; waiting on rectified condition fixed by 
other Procedure or operator intervention
{noformat}




> RIT recovery with ServerCrashProcedure is broken in multiple ways
> -----------------------------------------------------------------
>
>                 Key: HBASE-21614
>                 URL: https://issues.apache.org/jira/browse/HBASE-21614
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Master is restarting after a previous master crashed while recovering some 
> regions from a dead server.
> Master recovers RIT for the region, however the RIT has no location (logged, 
> at least) in CONFIRM_CLOSE state. That is a potential problem #0.5 - confirm 
> where? But that should be covered by meta, so not a big deal, right. As such 
> it doesn't seem to add the region to server map anywhere
> {noformat}
> 2018-12-17 14:51:14,606 INFO  [master/:17000:becomeActiveMaster] 
> assignment.AssignmentManager: Attach pid=38015, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=false; 
> TransitRegionStateProcedure table=t1, region=region1, REOPEN/MOVE to 
> rit=OFFLINE, location=null, table=t1, region=region1 to restore RIT
> {noformat}
> However, in this case ServerCrashProcedure for the server kicks off BEFORE 
> meta is loaded.
> That seems to be a problem #1 - it immediately gets regions to later recover, 
> so in this case it gets nothing.
> I've grepped our logs for successful cases of SCP interacting with region 
> transition at startup, and in all cases the meta was loaded before SCP.
> Seems like a race condition.
> {noformat}
> 2018-12-17 14:51:14,625 INFO  [master/:17000:becomeActiveMaster] 
> master.RegionServerTracker: Starting RegionServerTracker; 0 have existing 
> ServerCrashProcedures, 103 possibly 'live' servers, and 1 'splitting'.
> 2018-12-17 14:51:20,770 INFO  [master/:17000:becomeActiveMaster] 
> master.ServerManager: Processing expiration of server1,17020,1544636616174 on 
> ,17000,1545087053243
> 2018-12-17 14:51:20,921 INFO  [master/:17000:becomeActiveMaster] 
> assignment.AssignmentManager: Added server1,17020,1544636616174 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=111298
> 2018-12-17 14:51:30,728 INFO  [PEWorker-13] procedure.ServerCrashProcedure: 
> Start pid=111298, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; 
> ServerCrashProcedure server=server1,17020,1544636616174, splitWal=true, 
> meta=false
> {noformat}
> Meta is only loaded 11-12 seconds later.
> If one looks at meta-loading code however, there is one more problem #2 - the 
> region is in CLOSING state, so the {{addRegionToServer}} is not going to be 
> called - it's only called for OPENED regions. 
> Expanding on the above, I've only seen SCP unblock stuck region transition at 
> startup when region started out in meta as OPEN.
> {noformat}
> 2018-12-17 14:51:42,403 INFO  [master/:17000:becomeActiveMaster] 
> assignment.RegionStateStore: Load hbase:meta entry region=region1, 
> regionState=CLOSING, lastHost=server1,17020,1544636616174, 
> regionLocation=server1,17020,1544636616174, openSeqNum=629131
> {noformat}
> SCP predictably finishes without doing anything; no other logs for this pid
> {noformat}
> 2018-12-17 14:52:19,046 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=111298, state=SUCCESS, hasLock=false; ServerCrashProcedure 
> server=server1,17020,1544636616174, splitWal=true, meta=false in 58.0010sec
> {noformat}
> After that, region is still stuck trying to be closed in 
> TransitRegionStateProcedure; it's in the same state for hours including 
> across master restarts.
> {noformat}
> 2018-12-17 15:09:35,216 WARN  [PEWorker-14] 
> assignment.TransitRegionStateProcedure: Failed transition, suspend 604secs 
> pid=38015, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE, hasLock=true; 
> TransitRegionStateProcedure table=t1, region=region1, REOPEN/MOVE; 
> rit=CLOSING, location=server1,17020,1544636616174; waiting on rectified 
> condition fixed by other Procedure or operator intervention
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to