[ 
https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897535#comment-15897535
 ] 

stack commented on HBASE-17718:
-------------------------------

Thank you for review [~jerryhe]. Let me fix the rb comment and look at this 
test too...

> Difference between RS's servername and its ephemeral node cause SSH stop 
> working
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-17718
>                 URL: https://issues.apache.org/jira/browse/HBASE-17718
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.2.4, 1.1.8
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>         Attachments: HBASE-17718.master.001.patch, 
> HBASE-17718.master.002.patch
>
>
> After HBASE-9593, RS put up an ephemeral node in ZK before reporting for 
> duty. But if the hosts config (/etc/hosts) is different between master and 
> RS, RS's serverName can be different from the one stored the ephemeral zk 
> node. The email metioned in HBASE-13753 
> (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E)
>  is exactly what happened in our production env. 
> But what the email didn't point out is that the difference between serverName 
> in RS and zk node can cause SSH stop to work. as we can see from the code in 
> {{RegionServerTracker}}
> {code}
>   @Override
>   public void nodeDeleted(String path) {
>     if (path.startsWith(watcher.rsZNode)) {
>       String serverName = ZKUtil.getNodeName(path);
>       LOG.info("RegionServer ephemeral node deleted, processing expiration [" 
> +
>         serverName + "]");
>       ServerName sn = ServerName.parseServerName(serverName);
>       if (!serverManager.isServerOnline(sn)) {
>         LOG.warn(serverName.toString() + " is not online or isn't known to 
> the master."+
>          "The latter could be caused by a DNS misconfiguration.");
>         return;
>       }
>       remove(sn);
>       this.serverManager.expireServer(sn);
>     }
>   }
> {code}
> The server will not be processed by SSH/ServerCrashProcedure. The regions on 
> this server will not been assigned again until master restart or failover.
> I know HBASE-9593 was to fix the issue if RS report to duty and crashed 
> before it can put up a zk node. It is a very rare case(And controllableļ¼Œ just 
> fix the bug making rs to crash). But The issue I metioned can happened more 
> often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, 
> etc.) and have more severe consequence.
> So here I offer some solutions to discuss:
> 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in 
> branch-0.98
> 2. Abort RS if master return a different name, otherwise SSH can't work 
> properly
> 3. Master accepts whatever servername reported by RS and don't change it.
> 4.correct the zk node if master return another name( idea from Ted Yu)
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to