[ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393824#comment-16393824
 ] 

stack commented on HBASE-20152:
-------------------------------

For the deadlock scenario, we might be able to get away w/ simple fixes

 * We were re-assigning a region though there was an unassign outstanding 
against the crashed server. This was a bug that had us doing an unnecessary 
assign and upping likelihood of deadlock (unassign and concurrent assign).
 * Deadlock happens because the unassign happens AFTER SCP cancels outstanding 
RITs but before our SCP Assign is scheduled. Rerunning the RIT cancel procedure 
after all assigns have been created and scheduled hopefully takes care of the 
case where an outstanding unassign has the region lock so the SCP assign can't 
progress; the re-run of the cancel of outstanding RITs against crashed server 
will cancel any unassigns that might have been lucky enough to have been 
scheduled.

The above should fix scenario #1.

Scenario #2 'Delete of Online Region' is tough. It happened while we had the 
patch over in HBASE-20137 was in place which allowed the UnassignProcedure 
progress when it needed to wait on SCP. I won't spend more time on it.

For Scenario #3, 'Move after Cluster Shutdown', just need to retrofit 
Procedures so likes of Move, etc., don't run when cluster shutdown and Master 
stop has been set (probably don't want to turn off AMv2 Procedures completely 
since we'll likely want to put shutdown itself into a Procedure...)





> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> --------------------------------------------------------
>
>                 Key: HBASE-20152
>                 URL: https://issues.apache.org/jira/browse/HBASE-20152
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being 
> assigned. Usually they happen when a DisableTableProcedure is running 
> concurrent with a ServerCrashProcedure. See below. See associated 
> HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey', 
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X from crashed server.
>  * DisableTable Procedure runs because test has finished and we're doing 
> table delete. Queues 
>  * UnassignProcedure for region X.
>  * Disable Unassign gets Lock on region X first.
>  * SCP AssignProcedure tries to get lock, waits on lock.
>  * DisableTable Procedure UnassignProcedure RPC fails because server is down 
> (Thats why the SCP).
>  * Tries to expire the server it failed the RPC against. Fails (currently 
> being SCP'd).
>  * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
> region X held
>  * SCP can't run because lock on X is held
>  * Test timesout.
> {code}
> Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 
> is the disable table:
> {code}
> 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] 
> procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure 
> server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false
> 2018-03-05 11:29:21,300 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
> 2018-03-05 11:29:21,307 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): 
> Client=jenkins//172.17.0.2 list rsgroup
> 2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): 
> Started disable of Group_ns:testKillRS
> 2018-03-05 11:29:21,313 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
> 2018-03-05 11:29:21,384 INFO  [PEWorker-9] 
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, 
> ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
> 2018-03-05 11:29:21,534 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> procedure2.ProcedureExecutor(865): Stored pid=309, 
> state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure 
> table=Group_ns:testKillRS
> 2018-03-05 11:29:21,542 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,644 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,847 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put 
> {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
> 2018-03-05 11:29:22,123 INFO  [PEWorker-5] hbase.MetaTableAccessor(1646): 
> Updated table Group_ns:testKillRS state to DISABLING in META
> 2018-03-05 11:29:22,148 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,345 INFO  [PEWorker-5] 
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524}]
> 2018-03-05 11:29:22,503 INFO  [PEWorker-13] 
> procedure.MasterProcedureScheduler(571): pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524, 
> Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
> 2018-03-05 11:29:22,503 INFO  [PEWorker-5] 
> procedure.MasterProcedureScheduler(571): pid=310, ppid=308, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
> 2018-03-05 11:29:22,504 INFO  [PEWorker-13] assignment.RegionStateStore(189): 
> pid=311 updating hbase:meta 
> row=Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043., 
> regionState=CLOSING, regionLocation=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,504 DEBUG [PEWorker-5] 
> assignment.RegionTransitionProcedure(377): LOCK_EVENT_WAIT pid=310 
> serverLocks={}, namespaceLocks={}, tableLocks={}, 
> regionLocks={{de7534c208a06502537cd95c248b3043=exclusiveLockOwner=311, 
> sharedLockCount=0, waitingProcCount=1}}, peerLocks={}
> 2018-03-05 11:29:22,504 DEBUG [PEWorker-5] 
> procedure2.ProcedureExecutor(1230): LOCK_EVENT_WAIT pid=310, ppid=308, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> {code}
> It disable table procedure unassign and the SCP assign arrive at same time, 
> in same ms (2018-03-05 11:29:22,503) but the unassign manages to get the lock 
> first.
> h3. Delete of online Regions
> Saw this in nightly failure #452 for branch-2 in 
> TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
> {code}
>  * DisableTableProcedure is queued before SCP.
>  * DisableTableProcedure Unassign fails because can't RPC to crashed server 
> and can't expire.
>  * Unassign is Stuck in suspend.
>  * SCP runs and cleans up suspended Disable Unassign.
>  * SCP completes which includes assign of Disable Unassign region.
>  * Disable Unassign completes
>  * Disable completes.
>  * A scheduled Drop Table Procedure runs (its end of test).
>  * Succeeds deleting regions that are actually assigned (see above where SCP 
> assigned region).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to