[ 
https://issues.apache.org/jira/browse/HBASE-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661762#comment-13661762
 ] 

ramkrishna.s.vasudevan commented on HBASE-8545:
-----------------------------------------------

Ok, so i get your point.  Yes we have to go with the timestamp which is to be 
compared if it has an older one.
But going through the code, when we do sendRegionOpen() we try to get the 
connection to the server as per the destination server in the plan.
Suppose it is the new server with latest timestamp.  We try to get the 
connection and from the 
{code}
 AdminService.BlockingInterface admin = this.rsAdmins.get(sn);
{code}
Here ServerName equals() method uses the compareTo which internally uses the 
startCode also.  So ideally we would not get any connection from the rsAdmins.
So we will try to form a new connection to the server but this time considering 
only the new hostname and port and without including the start code. So this 
will give a connection to the new RS with latest startcode.  As attached in the 
log message above the Servername printed is confusing when you see the 
startcode of it.
So if we are not able to do this validation on the RS side as we are not aware 
of the older timestamp we may have to compare the start code of the new 
connection and the one in the plan and throw an exception saying 
ServerNotRunningYetException. What do you say Jimmy.


                
> Meta stuck in transition when it is assigned to a just restarted dead region 
> sever 
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-8545
>                 URL: https://issues.apache.org/jira/browse/HBASE-8545
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: trunk-8545.patch, trunk-8545_v2.patch
>
>
> Support the meta region server is down, and the SSH tries to re-assign it.  
> This could happen:
> 1. AM plans to assign meta to a region server (R_old);
> 2. Now R_old is dead, the new region server (R_new) starts up on the same 
> host, port, but gets a different start code;
> 3. AM sends the open region request to R_new and the Meta is opened on it;
> 4. AM gets ZK event, but it is from a different region server instance 
> (R_new), not the expected one (R_old), so it sends a close region request to 
> R_new;
> 5. Now, the meta is stuck in transition and won't be assigned.
> This won't happen to a user region since the SSH for R_old will find out the 
> user region stuck in transition and re-assign it.  For meta, it is a little 
> different.  AM checks if a dead region server carries the meta based on the 
> ZK info, which is changed to the new region server R_new at step 3 by the 
> open region handler.
> The fix I was thinking about is:
> 1. In checking if a region server carries a region, uses the region 
> transition information if it exists (which is the source of truth, to 
> master), if not, checks the ZK data as before;
> 2. In open region handler, when transition assign zk node from offline to 
> opening, make sure the current region server is the expected one 
> (ZK#transitionNode, existing code doesn't check the target server name).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to