[ 
https://issues.apache.org/jira/browse/HBASE-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511719#comment-16511719
 ] 

Josh Elser commented on HBASE-20706:
------------------------------------

Talked to Stack offline, he asked me for a higher-level summary of how we 
didn't catch this "category" of issues with MTP thus far.

There have been two situations which have exacerbated this problem among others:
 # back-to-back rolling restarts
 ## On test finishes, we restart the cluster to undo config changes. Then the 
next test starts, and we restart the cluster again to set the necessary config. 
Something about these steps are enough to catch a non-OPEN region.
 # repeated table modifications
 ## Ambari metrics system uses phoenix. When the daemon that uses phoenix 
starts, it will submit some calls to modify the table configuration. I think 
something to do with a modify close to a restart triggered this.

It might be a good idea to back through our chaos monkey and expand it with 
some table modification steps.

> [hack] Don't add known not-OPEN regions in reopen phase of MTP
> --------------------------------------------------------------
>
>                 Key: HBASE-20706
>                 URL: https://issues.apache.org/jira/browse/HBASE-20706
>             Project: HBase
>          Issue Type: Sub-task
>          Components: amv2
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 3.0.0, 2.1.0, 2.0.1
>
>         Attachments: HBASE-20706.001.branch-2.0.patch, 
> HBASE-20706.002.branch-2.0.patch
>
>
> Shake-down of ModifyTableProcedure, talked this one out with Stack – "proper" 
> fix is likely pending in HBASE-20682. Using MoveRegionProcedure is likely the 
> wrong construct, we would want something specific to reopen (e.g. a 
> ReopenProcedure).
> However, we're in a really bad state right now. If there are non-open regions 
> for a table which has a modify submitted against it, the entire system locks 
> up in a fast-spin while holding the table's lock. This fills up HDFS with PV2 
> wals, and prevents you from doing anything in the hbase shell to try to fix 
> those unassigned regions. You'll see spam in the master log like:
> {noformat}
> 2018-06-07 03:21:29,448 WARN  [PEWorker-1] procedure.ModifyTableProcedure: 
> Retriable error trying to modify table=METRIC_RECORD_HOURLY_UUID (in 
> state=MODIFY_TABLE_REOPEN_ALL_REGIONS)
> org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
> a3dc333606d38aeb6e2ab4b94233cfbc is not OPEN
>         at 
> org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:193)
>         at 
> org.apache.hadoop.hbase.master.assignment.MoveRegionProcedure.<init>(MoveRegionProcedure.java:67)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:767)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createReopenProcedures(AssignmentManager.java:705)
>         at 
> org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.executeFromState(ModifyTableProcedure.java:128)
>         at 
> org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.executeFromState(ModifyTableProcedure.java:50)
>         at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1472)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1240)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760)
> {noformat}
> We unstuck out internal test cluster giving the following change on top of 
> Sergey's HBASE-20657. When choosing the regions to reopen, if we filter out a 
> table's regions to only be those which are currently OPEN. There may be some 
> transient failures here as well, but a subsequent retry of the reopen step 
> should filter out that change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to