Ray Mattingly created HBASE-27975:
-------------------------------------

             Summary: Region (un)assignment should have a more direct timeout
                 Key: HBASE-27975
                 URL: https://issues.apache.org/jira/browse/HBASE-27975
             Project: HBase
          Issue Type: Improvement
            Reporter: Ray Mattingly


h3. Problem

We've observed a few cases in which region (un)assignment can hang for 
significant, and sometimes seemingly indefinite, periods of time. This results 
in unpredictably long downtime which must be remediated via manually initiated 
ServerCrashProcedures.
h3. Example 1

If a RS is unable to communicate with the NameNode and it is asked to close a 
region then its RS_CLOSE_REGION thread will get stuck awaiting a NN failover. 
Due to several default configurations of options like:
 * hbase.hstore.flush.retries.number
 * hbase.server.pause
 * dfs.client.failover.max.attempts
 * dfs.client.failover.sleep.base.millis
 * dfs.client.failover.max.attempts

this region unassignment attempt will hang for approximately 30 minutes before 
it allows the failure to bubble up and automatically trigger a 
ServerCrashProcedure.

One can tune the aforementioned options to reduce the TTR here, but it's not a 
very obvious/direct solution.
h3. Example 2

In rare cases our public cloud provider may supply us with machines that have 
degraded hardware. If we're unable to catch this degradation prior to startup, 
then we've observed that the degraded RegionServer process may come online; as 
a result it will be assigned regions which can often never actually be 
successfully opened. If the RegionServer's assignment handling fails to 
intentionally fail, then there will never be outside intervention; the 
assignment will be stuck hanging indefinitely. I've written [a unit 
test|https://github.com/apache/hbase/compare/master...HubSpot:hbase:rsit-opening-repro]
 which reproduces this behavior. On this same branch is a unit test 
demonstrating that a timeout placed on the AssignRegionHandler helps to fast 
fail and reliably trigger the necessary ServerCrashProcedure.
h3. Proposal

I want to propose that we add optional and configurable timeouts to the 
AssignRegion and UnassignRegion event handlers.

This would allow us to much more intentionally & clearly prevent long running 
retries for these downtime inducing procedures and could consequently improve 
our reliability in both examples.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to