[jira] [Commented] (HBASE-20634) Reopen region while server crash can cause the procedure to be stuck

Duo Zhang (JIRA) Sun, 27 May 2018 05:52:13 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492013#comment-16492013
 ]


Duo Zhang commented on HBASE-20634:
-----------------------------------

I used a dummy procedure to hold the lock of the region and then release it to 
make the failure happen.

This is the dummy procedure.
{noformat}
2018-05-27 20:33:18,087 INFO  [PEWorker-3] 
procedure.MasterProcedureScheduler(640): pid=9, state=RUNNABLE:STATE; 
DummyRegionProcedure table=test, region=e2c647c382ead17faf1e4ff1d42800f5 
checking lock on e2c647c382ead17faf1e4ff1d42800f5
{noformat}

And then the move region procedure, you can see that it is waiting on the 
region lock.
{noformat}
2018-05-27 20:33:18,367 INFO  [PEWorker-4] 
procedure.MasterProcedureScheduler(640): pid=10, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=e2c647c382ead17faf1e4ff1d42800f5, source=ubuntu,39829,1527424391444, 
destination=ubuntu,32879,1527424391176 checking lock on 
e2c647c382ead17faf1e4ff1d42800f5
2018-05-27 20:33:18,368 DEBUG [PEWorker-4] procedure2.ProcedureExecutor(1249): 
LOCK_EVENT_WAIT pid=10, state=RUNNABLE:MOVE_REGION_UNASSIGN; 
MoveRegionProcedure hri=e2c647c382ead17faf1e4ff1d42800f5, 
source=ubuntu,39829,1527424391444, destination=ubuntu,32879,1527424391176
{noformat}

The server crash procedure
{noformat}
2018-05-27 20:33:18,605 INFO  [PEWorker-5] procedure.ServerCrashProcedure(119): 
Start pid=11, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=ubuntu,39829,1527424391444, splitWal=true, meta=false
{noformat}

And finally it schedules the assign procedure. Also waiting on the region lock.
{noformat}
2018-05-27 20:33:19,102 INFO  [PEWorker-5] procedure2.ProcedureExecutor(1515): 
Initialized subprocedures=[{pid=12, ppid=11, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test, 
region=e2c647c382ead17faf1e4ff1d42800f5}]
2018-05-27 20:33:19,206 INFO  [PEWorker-5] 
procedure.MasterProcedureScheduler(640): pid=12, ppid=11, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test, 
region=e2c647c382ead17faf1e4ff1d42800f5 checking lock on 
e2c647c382ead17faf1e4ff1d42800f5
2018-05-27 20:33:19,207 DEBUG [PEWorker-5] 
assignment.RegionTransitionProcedure(441): LOCK_EVENT_WAIT pid=12 
serverLocks={}, namespaceLocks={{default=exclusiveLockOwner=NONE, 
sharedLockCount=1, waitingProcCount=0}}, 
tableLocks={{test=exclusiveLockOwner=NONE, sharedLockCount=1, 
waitingProcCount=0}}, 
regionLocks={{e2c647c382ead17faf1e4ff1d42800f5=exclusiveLockOwner=9, 
sharedLockCount=0, waitingProcCount=2}}, peerLocks={}
{noformat}

And in the end I finished the dummy procedure, and finally we are stuck
{noformat}
2018-05-27 20:33:19,232 INFO  [PEWorker-3] procedure2.ProcedureExecutor(1265): 
Finished pid=9, state=SUCCESS; DummyRegionProcedure table=test, 
region=e2c647c382ead17faf1e4ff1d42800f5 in 1.2430sec
2018-05-27 20:33:19,235 INFO  [PEWorker-10] 
procedure.MasterProcedureScheduler(640): pid=10, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=e2c647c382ead17faf1e4ff1d42800f5, source=ubuntu,39829,1527424391444, 
destination=ubuntu,32879,1527424391176 checking lock on 
e2c647c382ead17faf1e4ff1d42800f5
2018-05-27 20:33:19,239 INFO  [PEWorker-10] procedure2.ProcedureExecutor(1515): 
Initialized subprocedures=[{pid=13, ppid=10, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test, 
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444}]
2018-05-27 20:33:19,256 INFO  [PEWorker-10] 
procedure.MasterProcedureScheduler(640): pid=13, ppid=10, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test, 
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444 
checking lock on e2c647c382ead17faf1e4ff1d42800f5
2018-05-27 20:33:19,256 INFO  [PEWorker-10] assignment.RegionStateStore(197): 
pid=13 updating hbase:meta row=e2c647c382ead17faf1e4ff1d42800f5, 
regionState=CLOSING, regionLocation=ubuntu,39829,1527424391444
2018-05-27 20:33:19,260 INFO  [PEWorker-10] 
assignment.RegionTransitionProcedure(251): Dispatch pid=13, ppid=10, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test, 
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444; 
rit=CLOSING, location=ubuntu,39829,1527424391444
2018-05-27 20:33:19,260 WARN  [PEWorker-10] 
assignment.RegionTransitionProcedure(225): Remote call failed pid=13, ppid=10, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test, 
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444; 
rit=CLOSING, location=ubuntu,39829,1527424391444; exception=pid=13, ppid=10, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test, 
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444 to 
ubuntu,39829,1527424391444
2018-05-27 20:33:19,260 WARN  [PEWorker-10] assignment.UnassignProcedure(281): 
Expiring server pid=13, ppid=10, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
UnassignProcedure table=test, region=e2c647c382ead17faf1e4ff1d42800f5, 
server=ubuntu,39829,1527424391444; rit=CLOSING, 
location=ubuntu,39829,1527424391444, 
exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
 pid=13, ppid=10, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=test, region=e2c647c382ead17faf1e4ff1d42800f5, 
server=ubuntu,39829,1527424391444 to ubuntu,39829,1527424391444
2018-05-27 20:33:19,260 WARN  [PEWorker-10] master.ServerManager(580): 
Expiration of ubuntu,39829,1527424391444 but server shutdown already in progress
{noformat}

> Reopen region while server crash can cause the procedure to be stuck
> --------------------------------------------------------------------
>
>                 Key: HBASE-20634
>                 URL: https://issues.apache.org/jira/browse/HBASE-20634
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Priority: Critical
>             Fix For: 3.0.0, 2.1.0, 2.0.1
>
>         Attachments: HBASE-20634-UT.patch
>
>
> Found this when implementing HBASE-20424, where we will transit the peer sync 
> replication state while there is server crash.
> The problem is that, in ServerCrashAssign, we do not have the region lock, so 
> it is possible that after we call handleRIT to clear the existing 
> assign/unassign procedures related to this rs, and before we schedule the 
> assign procedures, it is possible that that we schedule a unassign procedure 
> for a region on the crashed rs. This procedure will not receive the 
> ServerCrashException, instead, in addToRemoteDispatcher, it will find that it 
> can not dispatch the remote call and then a  FailedRemoteDispatchException 
> will be raised. But we do not treat this exception the same with 
> ServerCrashException, instead, we will try to expire the rs. Obviously the rs 
> has already been marked as expired, so this is almost a no-op. Then the 
> procedure will be stuck there for ever.
> A possible way to fix it is to treat FailedRemoteDispatchException the same 
> with ServerCrashException, as it will be created in addToRemoteDispatcher 
> only, and the only reason we can not dispatch a remote call is that the rs 
> has already been dead. The nodeMap is a ConcurrentMap so I think we could use 
> it as a guard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20634) Reopen region while server crash can cause the procedure to be stuck

Reply via email to