[
https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492013#comment-16492013
]
Duo Zhang commented on HBASE-20634:
-----------------------------------
I used a dummy procedure to hold the lock of the region and then release it to
make the failure happen.
This is the dummy procedure.
{noformat}
2018-05-27 20:33:18,087 INFO [PEWorker-3]
procedure.MasterProcedureScheduler(640): pid=9, state=RUNNABLE:STATE;
DummyRegionProcedure table=test, region=e2c647c382ead17faf1e4ff1d42800f5
checking lock on e2c647c382ead17faf1e4ff1d42800f5
{noformat}
And then the move region procedure, you can see that it is waiting on the
region lock.
{noformat}
2018-05-27 20:33:18,367 INFO [PEWorker-4]
procedure.MasterProcedureScheduler(640): pid=10,
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
hri=e2c647c382ead17faf1e4ff1d42800f5, source=ubuntu,39829,1527424391444,
destination=ubuntu,32879,1527424391176 checking lock on
e2c647c382ead17faf1e4ff1d42800f5
2018-05-27 20:33:18,368 DEBUG [PEWorker-4] procedure2.ProcedureExecutor(1249):
LOCK_EVENT_WAIT pid=10, state=RUNNABLE:MOVE_REGION_UNASSIGN;
MoveRegionProcedure hri=e2c647c382ead17faf1e4ff1d42800f5,
source=ubuntu,39829,1527424391444, destination=ubuntu,32879,1527424391176
{noformat}
The server crash procedure
{noformat}
2018-05-27 20:33:18,605 INFO [PEWorker-5] procedure.ServerCrashProcedure(119):
Start pid=11, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=ubuntu,39829,1527424391444, splitWal=true, meta=false
{noformat}
And finally it schedules the assign procedure. Also waiting on the region lock.
{noformat}
2018-05-27 20:33:19,102 INFO [PEWorker-5] procedure2.ProcedureExecutor(1515):
Initialized subprocedures=[{pid=12, ppid=11,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test,
region=e2c647c382ead17faf1e4ff1d42800f5}]
2018-05-27 20:33:19,206 INFO [PEWorker-5]
procedure.MasterProcedureScheduler(640): pid=12, ppid=11,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=test,
region=e2c647c382ead17faf1e4ff1d42800f5 checking lock on
e2c647c382ead17faf1e4ff1d42800f5
2018-05-27 20:33:19,207 DEBUG [PEWorker-5]
assignment.RegionTransitionProcedure(441): LOCK_EVENT_WAIT pid=12
serverLocks={}, namespaceLocks={{default=exclusiveLockOwner=NONE,
sharedLockCount=1, waitingProcCount=0}},
tableLocks={{test=exclusiveLockOwner=NONE, sharedLockCount=1,
waitingProcCount=0}},
regionLocks={{e2c647c382ead17faf1e4ff1d42800f5=exclusiveLockOwner=9,
sharedLockCount=0, waitingProcCount=2}}, peerLocks={}
{noformat}
And in the end I finished the dummy procedure, and finally we are stuck
{noformat}
2018-05-27 20:33:19,232 INFO [PEWorker-3] procedure2.ProcedureExecutor(1265):
Finished pid=9, state=SUCCESS; DummyRegionProcedure table=test,
region=e2c647c382ead17faf1e4ff1d42800f5 in 1.2430sec
2018-05-27 20:33:19,235 INFO [PEWorker-10]
procedure.MasterProcedureScheduler(640): pid=10,
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
hri=e2c647c382ead17faf1e4ff1d42800f5, source=ubuntu,39829,1527424391444,
destination=ubuntu,32879,1527424391176 checking lock on
e2c647c382ead17faf1e4ff1d42800f5
2018-05-27 20:33:19,239 INFO [PEWorker-10] procedure2.ProcedureExecutor(1515):
Initialized subprocedures=[{pid=13, ppid=10,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test,
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444}]
2018-05-27 20:33:19,256 INFO [PEWorker-10]
procedure.MasterProcedureScheduler(640): pid=13, ppid=10,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test,
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444
checking lock on e2c647c382ead17faf1e4ff1d42800f5
2018-05-27 20:33:19,256 INFO [PEWorker-10] assignment.RegionStateStore(197):
pid=13 updating hbase:meta row=e2c647c382ead17faf1e4ff1d42800f5,
regionState=CLOSING, regionLocation=ubuntu,39829,1527424391444
2018-05-27 20:33:19,260 INFO [PEWorker-10]
assignment.RegionTransitionProcedure(251): Dispatch pid=13, ppid=10,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test,
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444;
rit=CLOSING, location=ubuntu,39829,1527424391444
2018-05-27 20:33:19,260 WARN [PEWorker-10]
assignment.RegionTransitionProcedure(225): Remote call failed pid=13, ppid=10,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test,
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444;
rit=CLOSING, location=ubuntu,39829,1527424391444; exception=pid=13, ppid=10,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=test,
region=e2c647c382ead17faf1e4ff1d42800f5, server=ubuntu,39829,1527424391444 to
ubuntu,39829,1527424391444
2018-05-27 20:33:19,260 WARN [PEWorker-10] assignment.UnassignProcedure(281):
Expiring server pid=13, ppid=10, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
UnassignProcedure table=test, region=e2c647c382ead17faf1e4ff1d42800f5,
server=ubuntu,39829,1527424391444; rit=CLOSING,
location=ubuntu,39829,1527424391444,
exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
pid=13, ppid=10, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
table=test, region=e2c647c382ead17faf1e4ff1d42800f5,
server=ubuntu,39829,1527424391444 to ubuntu,39829,1527424391444
2018-05-27 20:33:19,260 WARN [PEWorker-10] master.ServerManager(580):
Expiration of ubuntu,39829,1527424391444 but server shutdown already in progress
{noformat}
> Reopen region while server crash can cause the procedure to be stuck
> --------------------------------------------------------------------
>
> Key: HBASE-20634
> URL: https://issues.apache.org/jira/browse/HBASE-20634
> Project: HBase
> Issue Type: Bug
> Reporter: Duo Zhang
> Priority: Critical
> Fix For: 3.0.0, 2.1.0, 2.0.1
>
> Attachments: HBASE-20634-UT.patch
>
>
> Found this when implementing HBASE-20424, where we will transit the peer sync
> replication state while there is server crash.
> The problem is that, in ServerCrashAssign, we do not have the region lock, so
> it is possible that after we call handleRIT to clear the existing
> assign/unassign procedures related to this rs, and before we schedule the
> assign procedures, it is possible that that we schedule a unassign procedure
> for a region on the crashed rs. This procedure will not receive the
> ServerCrashException, instead, in addToRemoteDispatcher, it will find that it
> can not dispatch the remote call and then a FailedRemoteDispatchException
> will be raised. But we do not treat this exception the same with
> ServerCrashException, instead, we will try to expire the rs. Obviously the rs
> has already been marked as expired, so this is almost a no-op. Then the
> procedure will be stuck there for ever.
> A possible way to fix it is to treat FailedRemoteDispatchException the same
> with ServerCrashException, as it will be created in addToRemoteDispatcher
> only, and the only reason we can not dispatch a remote call is that the rs
> has already been dead. The nodeMap is a ConcurrentMap so I think we could use
> it as a guard.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)