Josh Elser created HBASE-23011:
----------------------------------
Summary: AP stuck in retry loop if underlying table no longer
exists
Key: HBASE-23011
URL: https://issues.apache.org/jira/browse/HBASE-23011
Project: HBase
Issue Type: Bug
Affects Versions: 2.1.6, 2.0.6
Reporter: Josh Elser
Assignee: Josh Elser
Looking at a user's issue with [~wchevreuil]... While the details of how
exactly we got into this situation are murky, I'm noticing that we have a
situation where an AP can get stuck resubmitting itself over and over if,
somehow, the table the region the AP is assigning gets deleted.
{noformat}
2019-08-25 23:33:54,588 WARN [PEWorker-11]
assignment.RegionTransitionProcedure: Failed transition, suspend 1secs
pid=1100250, ppid=1100195, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true;
AssignProcedure table=<tablename>, region=<regionid>; rit=OFFLINE,
location=null; waiting on rectified condition fixed by other Procedure or
operator intervention
org.apache.hadoop.hbase.master.TableStateManager$TableStateNotFoundException:
monitoring:test1
at
org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:215)
at
org.apache.hadoop.hbase.master.assignment.AssignProcedure.assign(AssignProcedure.java:195)
at
org.apache.hadoop.hbase.master.assignment.AssignProcedure.startTransition(AssignProcedure.java:206)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:364)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:98)
at
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:958)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1836)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1596)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:80)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2141)
{noformat}
Stack trace looks like similar to the above.
The problem appears to be that we don't catch the
{{TableStateNotFoundException}} coming out of
{{TableStateManager#getTableState(TableName)}}. This keeps the AP in a
fail/resubmit loop (until, presumably, someone comes along with an `HBCK2
bypass`). This is only a problem in branch-2.0 and branch-2.1.
{{TransitRegionStateProcedure}} in branch-2.2+ doesn't have the same issue (at
least on the surface).
As mentioned earlier, it's not clear how we got this SCP(1100195)->AP(1100250)
scheduled while the table itself is actually deleted. Some quick attempts to
reproduce this locally weren't successful. I'm not sure if I can write a
meaningful test. Need to try to look more closely at that, but will attach a
patch which I think will work around the issue.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)