[
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mike Drob updated HBASE-20152:
------------------------------
Issue Type: Umbrella (was: Bug)
> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> --------------------------------------------------------
>
> Key: HBASE-20152
> URL: https://issues.apache.org/jira/browse/HBASE-20152
> Project: HBase
> Issue Type: Umbrella
> Components: amv2
> Reporter: stack
> Assignee: stack
> Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being
> assigned. Usually they happen when a DisableTableProcedure is running
> concurrent with a ServerCrashProcedure. See below. See associated
> HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey',
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
> * SCP is running because a server was aborted in test.
> * SCP starts AssignProcedure of region X from crashed server.
> * DisableTable Procedure runs because test has finished and we're doing
> table delete. Queues
> * UnassignProcedure for region X.
> * Disable Unassign gets Lock on region X first.
> * SCP AssignProcedure tries to get lock, waits on lock.
> * DisableTable Procedure UnassignProcedure RPC fails because server is down
> (Thats why the SCP).
> * Tries to expire the server it failed the RPC against. Fails (currently
> being SCP'd).
> * DisableTable Procedure Unassign is suspended. It is a suspend with lock on
> region X held
> * SCP can't run because lock on X is held
> * Test timesout.
> {code}
> Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309
> is the disable table:
> {code}
> 2018-03-05 11:29:21,224 DEBUG [PEWorker-7]
> procedure.ServerCrashProcedure(225): Done splitting WALs pid=308,
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure
> server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false
> 2018-03-05 11:29:21,300 INFO
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
> rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
> 2018-03-05 11:29:21,307 INFO
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
> rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279):
> Client=jenkins//172.17.0.2 list rsgroup
> 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901):
> Started disable of Group_ns:testKillRS
> 2018-03-05 11:29:21,313 INFO
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
> master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
> 2018-03-05 11:29:21,384 INFO [PEWorker-9]
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310,
> ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
> 2018-03-05 11:29:21,534 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
> procedure2.ProcedureExecutor(865): Stored pid=309,
> state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure
> table=Group_ns:testKillRS
> 2018-03-05 11:29:21,542 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,644 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,847 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put
> {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
> 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646):
> Updated table Group_ns:testKillRS state to DISABLING in META
> 2018-03-05 11:29:22,148 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,345 INFO [PEWorker-5]
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311,
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524}]
> 2018-03-05 11:29:22,503 INFO [PEWorker-13]
> procedure.MasterProcedureScheduler(571): pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524,
> Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
> 2018-03-05 11:29:22,503 INFO [PEWorker-5]
> procedure.MasterProcedureScheduler(571): pid=310, ppid=308,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
> 2018-03-05 11:29:22,504 INFO [PEWorker-13] assignment.RegionStateStore(189):
> pid=311 updating hbase:meta
> row=Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.,
> regionState=CLOSING, regionLocation=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,504 DEBUG [PEWorker-5]
> assignment.RegionTransitionProcedure(377): LOCK_EVENT_WAIT pid=310
> serverLocks={}, namespaceLocks={}, tableLocks={},
> regionLocks={{de7534c208a06502537cd95c248b3043=exclusiveLockOwner=311,
> sharedLockCount=0, waitingProcCount=1}}, peerLocks={}
> 2018-03-05 11:29:22,504 DEBUG [PEWorker-5]
> procedure2.ProcedureExecutor(1230): LOCK_EVENT_WAIT pid=310, ppid=308,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043
> 2018-03-05 11:29:22,507 INFO [PEWorker-13]
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN [PEWorker-13]
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311,
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN [PEWorker-13]
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524,
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
> pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> UnassignProcedure table=Group_ns:testKillRS,
> region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> {code}
> It disable table procedure unassign and the SCP assign arrive at same time,
> in same ms (2018-03-05 11:29:22,503) but the unassign manages to get the lock
> first.
> h3. Delete of online Regions
> Saw this in nightly failure #452 for branch-2 in
> TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
> {code}
> * DisableTableProcedure is queued before SCP.
> * DisableTableProcedure Unassign fails because can't RPC to crashed server
> and can't expire.
> * Unassign is Stuck in suspend.
> * SCP runs and cleans up suspended Disable Unassign.
> * SCP completes which includes assign of Disable Unassign region.
> * Disable Unassign completes
> * Disable completes.
> * A scheduled Drop Table Procedure runs (its end of test).
> * Succeeds deleting regions that are actually assigned (see above where SCP
> assigned region).
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)