[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure

stack (JIRA) Thu, 08 Mar 2018 00:19:16 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390901#comment-16390901
 ]


stack commented on HBASE-20152:
-------------------------------

New scenario seen on internal cluster (not sure what the test is doing... kinda 
crazy killing of cluster and then restarts..)

h3. Procedure after Cluster Shutdown
{code}
 * Cluster shutdown is set. This means cluster down flag is set, we expire 
servers, but no ServerCrashProcedure gets scheduled.
  2018-03-03 05:51:33,852 INFO org.apache.hadoop.hbase.master.ServerManager: 
Cluster shutdown set; quasar-lxosrm-3.vpc.cloudera.com,22101,1520080141368 
expired; onlineServers=2
  2018-03-03 05:51:33,859 INFO 
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node 
deleted, processing expiration 
[quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394]
 * Just after, a Move Region event comes in
  2018-03-03 05:51:33,999 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=158, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210., 
source=quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394, destination= 
hbase:namespace hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210.
* Move region tries to unassign but RPC fails with java.io.IOException: Call to 
quasar-lxosrm-2.vpc.cloudera.com/172.26.11.155:22101 failed on local exception: 
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
* Can't schedule an SCP. Fails because server is going down already.... not 
present as online. Procedure suspended.
* New Master, Unassign procedure rerun. Does dispatch to non-existent server.
* Fails to schedule SCP because server is long gone.
* Procedure is Suspended. Stuck forever.
{code}

> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> --------------------------------------------------------
>
>                 Key: HBASE-20152
>                 URL: https://issues.apache.org/jira/browse/HBASE-20152
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being 
> assigned. Usually they happen when a DisableTableProcedure is running 
> concurrent with a ServerCrashProcedure. See below. See associated 
> HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey', 
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X from crashed server.
>  * DisableTable Procedure runs because test has finished and we're doing 
> table delete. Queues 
>  * UnassignProcedure for region X.
>  * Disable Unassign gets Lock on region X first.
>  * SCP AssignProcedure tries to get lock, waits on lock.
>  * DisableTable Procedure UnassignProcedure RPC fails because server is down 
> (Thats why the SCP).
>  * Tries to expire the server it failed the RPC against. Fails (currently 
> being SCP'd).
>  * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
> region X held
>  * SCP can't run because lock on X is held
>  * Test timesout.
> {code}
> h3. Delete of online Regions
> Saw this in nightly failure #452 for branch-2 in 
> TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
> {code}
>  * DisableTableProcedure is queued before SCP.
>  * DisableTableProcedure Unassign fails because can't RPC to crashed server 
> and can't expire.
>  * Unassign is Stuck in suspend.
>  * SCP runs and cleans up suspended Disable Unassign.
>  * SCP completes which includes assign of Disable Unassign region.
>  * Disable Unassign completes
>  * Disable completes.
>  * A scheduled Drop Table Procedure runs (its end of test).
>  * Succeeds deleting regions that are actually assigned (see above where SCP 
> assigned region).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure

Reply via email to