Andor Molnar created HBASE-28339:
------------------------------------
Summary: HBaseReplicationEndpoint creates new ZooKeeper client
every time it tries to reconnect
Key: HBASE-28339
URL: https://issues.apache.org/jira/browse/HBASE-28339
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 2.5.7, 3.0.0-beta-1, 2.4.17, 2.6.0, 2.7.0
Reporter: Andor Molnar
Assignee: Andor Molnar
Asbtract base class {{HBaseReplicationEndpoint}} and therefore
{{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client instance
every time there's an error occurs in communication and it tries to reconnect.
This was not a problem with ZooKeeper 3.4.x versions, because the TGT Login
thread was a static reference and only created once for all clients in the same
JVM. With the upgrade to ZooKeeper 3.5.x the login thread is dedicated to the
client instance, hence we have a new login thread every time the replication
endpoint reconnects.
{code:java}
/**
* A private method used to re-establish a zookeeper session with a peer
cluster.
*/
protected void reconnect(KeeperException ke) {
if (
ke instanceof ConnectionLossException || ke instanceof
SessionExpiredException
|| ke instanceof AuthFailedException
) {
String clusterKey = ctx.getPeerConfig().getClusterKey();
LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke);
try {
reloadZkWatcher();
} catch (IOException io) {
LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey,
io);
}
}
}{code}
{code:java}
/**
* Closes the current ZKW (if not null) and creates a new one
* @throws IOException If anything goes wrong connecting
*/
synchronized void reloadZkWatcher() throws IOException {
if (zkw != null) zkw.close();
zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " +
ctx.getPeerId(), this);
getZkw().registerListener(new PeerRegionServerListener(this));
} {code}
If the target cluster of replication is unavailable for some reason, the
replication endpoint keeps trying to reconnect to ZooKeeper destroying and
creating new Login threads constantly which will carpet bomb the KDC host with
login requests.
I'm not sure how to fix this yet, trying to create a unit test first.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)