[ https://issues.apache.org/jira/browse/HBASE-23011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Elser updated HBASE-23011: ------------------------------- Attachment: HBASE-23011.001.branch-2.1.patch > AP stuck in retry loop if underlying table no longer exists > ----------------------------------------------------------- > > Key: HBASE-23011 > URL: https://issues.apache.org/jira/browse/HBASE-23011 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.6, 2.1.6 > Reporter: Josh Elser > Assignee: Josh Elser > Priority: Major > Attachments: HBASE-23011.001.branch-2.1.patch > > > Looking at a user's issue with [~wchevreuil]... While the details of how > exactly we got into this situation are murky, I'm noticing that we have a > situation where an AP can get stuck resubmitting itself over and over if, > somehow, the table the region the AP is assigning gets deleted. > {noformat} > 2019-08-25 23:33:54,588 WARN [PEWorker-11] > assignment.RegionTransitionProcedure: Failed transition, suspend 1secs > pid=1100250, ppid=1100195, state=RUNNABLE:REGION_TRANSITION_QUEUE, > locked=true; AssignProcedure table=<tablename>, region=<regionid>; > rit=OFFLINE, location=null; waiting on rectified condition fixed by other > Procedure or operator intervention > org.apache.hadoop.hbase.master.TableStateManager$TableStateNotFoundException: > monitoring:test1 > at > org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:215) > at > org.apache.hadoop.hbase.master.assignment.AssignProcedure.assign(AssignProcedure.java:195) > at > org.apache.hadoop.hbase.master.assignment.AssignProcedure.startTransition(AssignProcedure.java:206) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:364) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:98) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:958) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1836) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1596) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:80) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2141) > {noformat} > Stack trace looks like similar to the above. > The problem appears to be that we don't catch the > {{TableStateNotFoundException}} coming out of > {{TableStateManager#getTableState(TableName)}}. This keeps the AP in a > fail/resubmit loop (until, presumably, someone comes along with an `HBCK2 > bypass`). This is only a problem in branch-2.0 and branch-2.1. > {{TransitRegionStateProcedure}} in branch-2.2+ doesn't have the same issue > (at least on the surface). > As mentioned earlier, it's not clear how we got this > SCP(1100195)->AP(1100250) scheduled while the table itself is actually > deleted. Some quick attempts to reproduce this locally weren't successful. > I'm not sure if I can write a meaningful test. Need to try to look more > closely at that, but will attach a patch which I think will work around the > issue. -- This message was sent by Atlassian Jira (v8.3.2#803003)