[ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397969#comment-16397969
 ] 

Duo Zhang commented on HBASE-20152:
-----------------------------------

{quote}
RMP can deal w/ multiple runs in series. It will skip steps that don't need to 
be run (meta log splitting or reassign of meta). Only one RMP can run at a time 
(region lock).
{quote}

I think this will be a flaky part... RMP does not implement holdLock which 
means it is suspended, for example, when scheduling a sub procedure, then it 
will release the lock and another one can grab the lock...

Ideally, I prefer that, we should make meta online without depending on the 
procedure framework, just do the log splitting and assign in the master startup 
thread. Meta is special, the location is on zk so I think this is OK?

> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> --------------------------------------------------------
>
>                 Key: HBASE-20152
>                 URL: https://issues.apache.org/jira/browse/HBASE-20152
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being 
> assigned. Usually they happen when a DisableTableProcedure is running 
> concurrent with a ServerCrashProcedure. See below. See associated 
> HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey', 
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X from crashed server.
>  * DisableTable Procedure runs because test has finished and we're doing 
> table delete. Queues 
>  * UnassignProcedure for region X.
>  * Disable Unassign gets Lock on region X first.
>  * SCP AssignProcedure tries to get lock, waits on lock.
>  * DisableTable Procedure UnassignProcedure RPC fails because server is down 
> (Thats why the SCP).
>  * Tries to expire the server it failed the RPC against. Fails (currently 
> being SCP'd).
>  * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
> region X held
>  * SCP can't run because lock on X is held
>  * Test timesout.
> {code}
> Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 
> is the disable table:
> {code}
> 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] 
> procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure 
> server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false
> 2018-03-05 11:29:21,300 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
> 2018-03-05 11:29:21,307 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): 
> Client=jenkins//172.17.0.2 list rsgroup
> 2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): 
> Started disable of Group_ns:testKillRS
> 2018-03-05 11:29:21,313 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
> 2018-03-05 11:29:21,384 INFO  [PEWorker-9] 
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, 
> ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
> 2018-03-05 11:29:21,534 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> procedure2.ProcedureExecutor(865): Stored pid=309, 
> state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure 
> table=Group_ns:testKillRS
> 2018-03-05 11:29:21,542 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,644 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,847 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put 
> {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
> 2018-03-05 11:29:22,123 INFO  [PEWorker-5] hbase.MetaTableAccessor(1646): 
> Updated table Group_ns:testKillRS state to DISABLING in META
> 2018-03-05 11:29:22,148 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,345 INFO  [PEWorker-5] 
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524}]
> 2018-03-05 11:29:22,503 INFO  [PEWorker-13] 
> procedure.MasterProcedureScheduler(571): pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524, 
> Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
> 2018-03-05 11:29:22,503 INFO  [PEWorker-5] 
> procedure.MasterProcedureScheduler(571): pid=310, ppid=308, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
> 2018-03-05 11:29:22,504 INFO  [PEWorker-13] assignment.RegionStateStore(189): 
> pid=311 updating hbase:meta 
> row=Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043., 
> regionState=CLOSING, regionLocation=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,504 DEBUG [PEWorker-5] 
> assignment.RegionTransitionProcedure(377): LOCK_EVENT_WAIT pid=310 
> serverLocks={}, namespaceLocks={}, tableLocks={}, 
> regionLocks={{de7534c208a06502537cd95c248b3043=exclusiveLockOwner=311, 
> sharedLockCount=0, waitingProcCount=1}}, peerLocks={}
> 2018-03-05 11:29:22,504 DEBUG [PEWorker-5] 
> procedure2.ProcedureExecutor(1230): LOCK_EVENT_WAIT pid=310, ppid=308, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> {code}
> It disable table procedure unassign and the SCP assign arrive at same time, 
> in same ms (2018-03-05 11:29:22,503) but the unassign manages to get the lock 
> first.
> h3. Delete of online Regions
> Saw this in nightly failure #452 for branch-2 in 
> TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
> {code}
>  * DisableTableProcedure is queued before SCP.
>  * DisableTableProcedure Unassign fails because can't RPC to crashed server 
> and can't expire.
>  * Unassign is Stuck in suspend.
>  * SCP runs and cleans up suspended Disable Unassign.
>  * SCP completes which includes assign of Disable Unassign region.
>  * Disable Unassign completes
>  * Disable completes.
>  * A scheduled Drop Table Procedure runs (its end of test).
>  * Succeeds deleting regions that are actually assigned (see above where SCP 
> assigned region).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to