[
https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893694#comment-15893694
]
Allan Yang edited comment on HBASE-17718 at 3/3/17 4:08 AM:
------------------------------------------------------------
So you suggest we revert HBASE-9593, [~stack]? Do you need me upload a patch or
just revert from your side? If so, please go ahead and help me resolve this
issue. Thank you, sir!
was (Author: allan163):
So you suggest we revert HBASE-9593, [~stack]? If so, please go ahead and help
me resolve this issue. Thank you, sir!
> Difference between RS's servername and its ephemeral node cause SSH stop
> working
> --------------------------------------------------------------------------------
>
> Key: HBASE-17718
> URL: https://issues.apache.org/jira/browse/HBASE-17718
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.0.0, 1.2.4, 1.1.8
> Reporter: Allan Yang
> Assignee: Allan Yang
>
> After HBASE-9593, RS put up an ephemeral node in ZK before reporting for
> duty. But if the hosts config (/etc/hosts) is different between master and
> RS, RS's serverName can be different from the one stored the ephemeral zk
> node. The email metioned in HBASE-13753
> (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E)
> is exactly what happened in our production env.
> But what the email didn't point out is that the difference between serverName
> in RS and zk node can cause SSH stop to work. as we can see from the code in
> {{RegionServerTracker}}
> {code}
> @Override
> public void nodeDeleted(String path) {
> if (path.startsWith(watcher.rsZNode)) {
> String serverName = ZKUtil.getNodeName(path);
> LOG.info("RegionServer ephemeral node deleted, processing expiration ["
> +
> serverName + "]");
> ServerName sn = ServerName.parseServerName(serverName);
> if (!serverManager.isServerOnline(sn)) {
> LOG.warn(serverName.toString() + " is not online or isn't known to
> the master."+
> "The latter could be caused by a DNS misconfiguration.");
> return;
> }
> remove(sn);
> this.serverManager.expireServer(sn);
> }
> }
> {code}
> The server will not be processed by SSH/ServerCrashProcedure. The regions on
> this server will not been assigned again until master restart or failover.
> I know HBASE-9593 was to fix the issue if RS report to duty and crashed
> before it can put up a zk node. It is a very rare case(And controllableļ¼ just
> fix the bug making rs to crash). But The issue I metioned can happened more
> often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config,
> etc.) and have more severe consequence.
> So here I offer some solutions to discuss:
> 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in
> branch-0.98
> 2. Abort RS if master return a different name, otherwise SSH can't work
> properly
> 3. Master accepts whatever servername reported by RS and don't change it.
> 4.correct the zk node if master return another name( idea from Ted Yu)
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)