Jimmy Xiang created HBASE-8545:
----------------------------------

             Summary: Meta stuck in transition when it is assigned to a just 
restarted dead region sever 
                 Key: HBASE-8545
                 URL: https://issues.apache.org/jira/browse/HBASE-8545
             Project: HBase
          Issue Type: Bug
          Components: Region Assignment
            Reporter: Jimmy Xiang
            Assignee: Jimmy Xiang


Support the meta region server is down, and the SSH tries to re-assign it.  
This could happen:

1. AM plans to assign meta to a region server (R_old);
2. Now R_old is dead, the new region server (R_new) starts up on the same host, 
port, but gets a different start code;
3. AM sends the open region request to R_new and the Meta is opened on it;
4. AM gets ZK event, but it is from a different region server instance (R_new), 
not the expected one (R_old), so it sends a close region request to R_new;
5. Now, the meta is stuck in transition and won't be assigned.

This won't happen to a user region since the SSH for R_old will find out the 
user region stuck in transition and re-assign it.  For meta, it is a little 
different.  AM checks if a dead region server carries the meta based on the ZK 
info, which is changed to the new region server R_new at step 3 by the open 
region handler.

The fix I was thinking about is:
1. In checking is a region server carries a region, uses the region transition 
information if it exists (which is the source of truth, to master), if not, 
checks the ZK data as before;
2. In open region handler, when transition assign zk node from offline to 
opening, make sure the current region server is the expected one 
(ZK#transitionNode, existing code doesn't check the target server name).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to