[
https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allan Yang updated HBASE-17718:
-------------------------------
Description:
After HBASE-9593, RS put up an ephemeral node in ZK before reporting for duty.
But if the hosts config (/etc/hosts) is different between master and RS, RS's
serverName can be different from the one stored the ephemeral zk node. The
email metioned in HBASE-13753
(http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E)
is exactly what happened in our production env.
But what the email didn't point out is that the difference between serverName
in RS and zk node can cause SSH stop to work. as we can see from the code in
{{RegionServerTracker}}
{code}
@Override
public void nodeDeleted(String path) {
if (path.startsWith(watcher.rsZNode)) {
String serverName = ZKUtil.getNodeName(path);
LOG.info("RegionServer ephemeral node deleted, processing expiration [" +
serverName + "]");
ServerName sn = ServerName.parseServerName(serverName);
if (!serverManager.isServerOnline(sn)) {
LOG.warn(serverName.toString() + " is not online or isn't known to the
master."+
"The latter could be caused by a DNS misconfiguration.");
return;
}
remove(sn);
this.serverManager.expireServer(sn);
}
}
{code}
The server will not be processed by SSH/ServerCrashProcedure. The regions on
this server will not been assigned again until master restart or failover.
I know HBASE-9593 was to fix the issue if RS report to duty and crashed before
it can put up a zk node. It is a very rare case(And controllable, just fix the
bug making rs to crash). But The issue I metioned can happened more often(and
uncontrollable, can't be fixed in HBase, due to DNS, hosts config, etc.) and
have more severe consequence.
So here I offer some solutions to discuss:
1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in
branch-0.98
2. Abort RS if master return a different name, otherwise SSH can't work properly
3. Master accepts whatever servername reported by RS and don't change it.
4.correct the zk node if master return another name( idea from Ted Yu)
was:
After HBASE-9593, RS put up an ephemeral node in ZK before reporting for duty.
But if the hosts config (/etc/hosts) is different between master and RS, RS's
serverName can be different from the one stored the ephemeral zk node. The
email metioned in HBASE-13753
(http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E)
is exactly what happened in our production env.
But what the email didn't point out is that the difference between serverName
in RS and zk node can cause SSH stop to work. as we can see from the code in
{{RegionServerTracker}}
{code}
@Override
public void nodeDeleted(String path) {
if (path.startsWith(watcher.rsZNode)) {
String serverName = ZKUtil.getNodeName(path);
LOG.info("RegionServer ephemeral node deleted, processing expiration [" +
serverName + "]");
ServerName sn = ServerName.parseServerName(serverName);
if (!serverManager.isServerOnline(sn)) {
LOG.warn(serverName.toString() + " is not online or isn't known to the
master."+
"The latter could be caused by a DNS misconfiguration.");
return;
}
remove(sn);
this.serverManager.expireServer(sn);
}
}
{code}
The server will not be processed by SSH/ServerCrashProcedure. The regions on
this server will not been assigned again until master restart or failover.
I know HBASE-9593 was to fix the issue if RS report to duty and crashed before
it can put up a zk node. It is a very rare case(And controllable, just fix the
bug making rs to crash). But The issue I metioned can happened more often(and
uncontrollable, can't be fixed in HBase, due to DNS, hosts config, etc.) and
have more severe consequence.
So here I offer some solutions to discuss:
1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in
branch-0.98
2. Abort RS if master return a different name, otherwise SSH can't work properly
3. Master accepts whatever servername reported by RS and don't change it.
> Difference between RS's servername and its ephemeral node cause SSH stop
> working
> --------------------------------------------------------------------------------
>
> Key: HBASE-17718
> URL: https://issues.apache.org/jira/browse/HBASE-17718
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.0.0, 1.2.4, 1.1.8
> Reporter: Allan Yang
> Assignee: Allan Yang
>
> After HBASE-9593, RS put up an ephemeral node in ZK before reporting for
> duty. But if the hosts config (/etc/hosts) is different between master and
> RS, RS's serverName can be different from the one stored the ephemeral zk
> node. The email metioned in HBASE-13753
> (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E)
> is exactly what happened in our production env.
> But what the email didn't point out is that the difference between serverName
> in RS and zk node can cause SSH stop to work. as we can see from the code in
> {{RegionServerTracker}}
> {code}
> @Override
> public void nodeDeleted(String path) {
> if (path.startsWith(watcher.rsZNode)) {
> String serverName = ZKUtil.getNodeName(path);
> LOG.info("RegionServer ephemeral node deleted, processing expiration ["
> +
> serverName + "]");
> ServerName sn = ServerName.parseServerName(serverName);
> if (!serverManager.isServerOnline(sn)) {
> LOG.warn(serverName.toString() + " is not online or isn't known to
> the master."+
> "The latter could be caused by a DNS misconfiguration.");
> return;
> }
> remove(sn);
> this.serverManager.expireServer(sn);
> }
> }
> {code}
> The server will not be processed by SSH/ServerCrashProcedure. The regions on
> this server will not been assigned again until master restart or failover.
> I know HBASE-9593 was to fix the issue if RS report to duty and crashed
> before it can put up a zk node. It is a very rare case(And controllable, just
> fix the bug making rs to crash). But The issue I metioned can happened more
> often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config,
> etc.) and have more severe consequence.
> So here I offer some solutions to discuss:
> 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in
> branch-0.98
> 2. Abort RS if master return a different name, otherwise SSH can't work
> properly
> 3. Master accepts whatever servername reported by RS and don't change it.
> 4.correct the zk node if master return another name( idea from Ted Yu)
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)