[ 
https://issues.apache.org/jira/browse/HBASE-23011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935353#comment-16935353
 ] 

Josh Elser commented on HBASE-23011:
------------------------------------

The user did eventually come back. The bypass (on the AP, recursive to bypass 
the whole SCP) did work, but there were a bunch of other queued CTP's that 
didn't clear out on their own.

Before I could figure out why the system was hung, the user just wiped 
everything. Closing this as I don't think we can make any progress on it.

> AP stuck in retry loop if underlying table no longer exists
> -----------------------------------------------------------
>
>                 Key: HBASE-23011
>                 URL: https://issues.apache.org/jira/browse/HBASE-23011
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.6, 2.1.6
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Major
>         Attachments: HBASE-23011.001.branch-2.1.patch
>
>
> Looking at a user's issue with [~wchevreuil]... While the details of how 
> exactly we got into this situation are murky, I'm noticing that we have a 
> situation where an AP can get stuck resubmitting itself over and over if, 
> somehow, the table the region the AP is assigning gets deleted.
> {noformat}
> 2019-08-25 23:33:54,588 WARN  [PEWorker-11] 
> assignment.RegionTransitionProcedure: Failed transition, suspend 1secs 
> pid=1100250, ppid=1100195, state=RUNNABLE:REGION_TRANSITION_QUEUE, 
> locked=true; AssignProcedure table=<tablename>, region=<regionid>; 
> rit=OFFLINE, location=null; waiting on rectified condition fixed by other 
> Procedure or operator intervention
> org.apache.hadoop.hbase.master.TableStateManager$TableStateNotFoundException: 
> monitoring:test1
>       at 
> org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:215)
>       at 
> org.apache.hadoop.hbase.master.assignment.AssignProcedure.assign(AssignProcedure.java:195)
>       at 
> org.apache.hadoop.hbase.master.assignment.AssignProcedure.startTransition(AssignProcedure.java:206)
>       at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:364)
>       at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:98)
>       at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:958)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1836)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1596)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:80)
>       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2141)
>  {noformat}
> Stack trace looks like similar to the above.
> The problem appears to be that we don't catch the 
> {{TableStateNotFoundException}} coming out of 
> {{TableStateManager#getTableState(TableName)}}. This keeps the AP in a 
> fail/resubmit loop (until, presumably, someone comes along with an `HBCK2 
> bypass`). This is only a problem in branch-2.0 and branch-2.1. 
> {{TransitRegionStateProcedure}} in branch-2.2+ doesn't have the same issue 
> (at least on the surface).
> As mentioned earlier, it's not clear how we got this 
> SCP(1100195)->AP(1100250) scheduled while the table itself is actually 
> deleted. Some quick attempts to reproduce this locally weren't successful. 
> I'm not sure if I can write a meaningful test. Need to try to look more 
> closely at that, but will attach a patch which I think will work around the 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to