[ https://issues.apache.org/jira/browse/HBASE-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662133#comment-13662133 ]
ramkrishna.s.vasudevan commented on HBASE-8545: ----------------------------------------------- Going thro the code once again, it is bit difficult to know whether the new connection is to the same server but with different host or entirely a new server. If the below code {code} if (isDeadServer(serverName)) { throw new RegionServerStoppedException(serverName + " is dead."); } {code} works then things will be fine. Will see if anything can be done here if not we can go with the current patch itself. > Meta stuck in transition when it is assigned to a just restarted dead region > sever > ----------------------------------------------------------------------------------- > > Key: HBASE-8545 > URL: https://issues.apache.org/jira/browse/HBASE-8545 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Reporter: Jimmy Xiang > Assignee: Jimmy Xiang > Attachments: trunk-8545.patch, trunk-8545_v2.patch > > > Support the meta region server is down, and the SSH tries to re-assign it. > This could happen: > 1. AM plans to assign meta to a region server (R_old); > 2. Now R_old is dead, the new region server (R_new) starts up on the same > host, port, but gets a different start code; > 3. AM sends the open region request to R_new and the Meta is opened on it; > 4. AM gets ZK event, but it is from a different region server instance > (R_new), not the expected one (R_old), so it sends a close region request to > R_new; > 5. Now, the meta is stuck in transition and won't be assigned. > This won't happen to a user region since the SSH for R_old will find out the > user region stuck in transition and re-assign it. For meta, it is a little > different. AM checks if a dead region server carries the meta based on the > ZK info, which is changed to the new region server R_new at step 3 by the > open region handler. > The fix I was thinking about is: > 1. In checking if a region server carries a region, uses the region > transition information if it exists (which is the source of truth, to > master), if not, checks the ZK data as before; > 2. In open region handler, when transition assign zk node from offline to > opening, make sure the current region server is the expected one > (ZK#transitionNode, existing code doesn't check the target server name). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira