Stephen Yuan Jiang created HBASE-12464:
------------------------------------------

             Summary: meta table region assignment stuck in the FAILED_OPEN 
state due to region server not fully ready to serve
                 Key: HBASE-12464
                 URL: https://issues.apache.org/jira/browse/HBASE-12464
             Project: HBase
          Issue Type: Bug
          Components: Region Assignment
    Affects Versions: 0.99.1, 1.0.0, 2.0.0
            Reporter: Stephen Yuan Jiang
            Assignee: Stephen Yuan Jiang
             Fix For: 1.0.0, 2.0.0


meta table region assignment could reach to the 'FAILED_OPEN' state, which 
makes the region not available unless the target region server shutdown or 
manual resolution.  This is undesirable state for meta tavle region.



Here is the sequence how this could happen (the code is in 
AssignmentManager::assign()):

Step 1: Master detects a region server (RS1) that hosts one meta table region 
is down, it changes the meta region state from 'online' to 'offline'

Step 2: In a loop (with configuable maximumAttempts count, default is 10, and 
minimal is 1), AssignmentManager tries to find a RS to host the meta table 
region.  If there is no RS available, it would loop forver by resetting the 
loop count (!!BUG#1 from this logic - a small bug!!) 

           if (region.isMetaRegion()) {
-            try {
-              Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment);
-              if (i == maximumAttempts) i = 1; // ==> BUG: if maximumAttempts 
is 1, then the loop will end.
-              continue;
-            } catch (InterruptedException e) {
-              ...
-            }

Step 3: Once a new RS is found (RS2), inside the same loop as Step 2, 
AssignmentManager tries to assign the meta region to RS2 (OFFLINE, RS1 => 
PENDING_OPEN, RS2).  If for some reason that opening the region in RS2 failed 
(eg. the target RS2 is not ready to serve - ServerNotRunningYetException), 
AssignmentManager would change the state from (PENDING_OPEN, RS2) to 
(FAILED_OPEN, RS2).  then it would retry (and even change the RS server to go 
to).  The retry is up to maximumAttempts.  Once the maximumAttempts is reached, 
the meta region will be in the 'FAILED_OPEN' state, unless either (1).  RS2 
shutdown to trigger region assignment again or (2). it is reassigned by an 
operator via HBase Shell.  

Based on the document ( http://hbase.apache.org/book/regions.arch.html ), this 
is by design - "17. For regions in FAILED_OPEN or FAILED_CLOSE states , the 
master tries to close them again when they are reassigned by an operator via 
HBase Shell.".  

However, this is bad design, espcially for meta table region (it is arguable 
that the design is good for regular table - for this ticket, I am more focus on 
fixing the meta region availablity issue).  



I propose 2 possible fixes:

Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta 
table region, reset the loop count so that it would not leave the loop with 
meta table region in FAILED_OPEN state.

Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide a 
way to automatically trigger AssignmentManager::assign() after a short period 
of time (leaving any region in FAILED_OPEN state or other states like 
'FAILED_CLOSE' is undesirable, should have some way to retrying and auto-heal 
the region).

I think at least for 1.0.0, Fix#1 is good enough.  We can open a task-type of 
JIRA for Fix#2 in future release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to