[
https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813446#comment-17813446
]
Duo Zhang commented on HBASE-28339:
-----------------------------------
We do not have backoff when retrying here?
> HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to
> reconnect
> --------------------------------------------------------------------------------------
>
> Key: HBASE-28339
> URL: https://issues.apache.org/jira/browse/HBASE-28339
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0
> Reporter: Andor Molnar
> Assignee: Andor Molnar
> Priority: Major
>
> Asbtract base class {{HBaseReplicationEndpoint}} and therefore
> {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client
> instance every time there's an error occurs in communication and it tries to
> reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the
> TGT Login thread was a static reference and only created once for all clients
> in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is
> dedicated to the client instance, hence we have a new login thread every time
> the replication endpoint reconnects.
> {code:java}
> /**
> * A private method used to re-establish a zookeeper session with a peer
> cluster.
> */
> protected void reconnect(KeeperException ke) {
> if (
> ke instanceof ConnectionLossException || ke instanceof
> SessionExpiredException
> || ke instanceof AuthFailedException
> ) {
> String clusterKey = ctx.getPeerConfig().getClusterKey();
> LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke);
> try {
> reloadZkWatcher();
> } catch (IOException io) {
> LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey,
> io);
> }
> }
> }{code}
> {code:java}
> /**
> * Closes the current ZKW (if not null) and creates a new one
> * @throws IOException If anything goes wrong connecting
> */
> synchronized void reloadZkWatcher() throws IOException {
> if (zkw != null) zkw.close();
> zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " +
> ctx.getPeerId(), this);
> getZkw().registerListener(new PeerRegionServerListener(this));
> } {code}
> If the target cluster of replication is unavailable for some reason, the
> replication endpoint keeps trying to reconnect to ZooKeeper destroying and
> creating new Login threads constantly which will carpet bomb the KDC host
> with login requests.
>
> I'm not sure how to fix this yet, trying to create a unit test first.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)