[
https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194445#comment-13194445
]
Jieshan Bean commented on HBASE-5153:
-------------------------------------
"The endless loop happens when ZK is actually down."
If ZK is actually down, the below code will throw a Exception:
this.zooKeeper = getZooKeeperWatcher();
Then catched by the below code:
{noformat}
try {
LOG.info("This client just lost it's session with ZooKeeper, trying" +
" to reconnect.");
resetZooKeeperTrackersWithRetries();
LOG.info("Reconnected successfully. This disconnect could have been" +
" caused by a network partition or a long-running GC pause," +
" either way it's recommended that you verify your environment.");
return;
} catch (ZooKeeperConnectionException e) {
LOG.error("Could not reconnect to ZooKeeper after session" +
" expiration, aborting");
t = e;
}
if (t != null) LOG.fatal(msg, t);
else LOG.fatal(msg);
HConnectionManager.deleteStaleConnection(this);
{noformat}
It should not be a endless loop. Does that make sense?
> Add retry logic in HConnectionImplementation#resetZooKeeperTrackers
> -------------------------------------------------------------------
>
> Key: HBASE-5153
> URL: https://issues.apache.org/jira/browse/HBASE-5153
> Project: HBase
> Issue Type: Bug
> Components: client
> Affects Versions: 0.90.4
> Reporter: Jieshan Bean
> Assignee: Jieshan Bean
> Fix For: 0.94.0, 0.90.6, 0.92.1
>
> Attachments: 5153-92.txt, 5153-trunk-v2.txt, 5153-trunk.txt,
> 5153-trunk.txt, HBASE-5153-V2.patch, HBASE-5153-V3.patch,
> HBASE-5153-V4-90.patch, HBASE-5153-V5-90.patch,
> HBASE-5153-V6-90-minorchange.patch, HBASE-5153-V6-90.txt,
> HBASE-5153-trunk-v2.patch, HBASE-5153-trunk.patch, HBASE-5153.patch,
> TestResults-hbase5153.out
>
>
> HBASE-4893 is related to this issue. In that issue, we know, if multi-threads
> share a same connection, once this connection got abort in one thread, the
> other threads will got a
> "HConnectionManager$HConnectionImplementation@18fb1f7 closed" exception.
> It solve the problem of "stale connection can't removed". But the orignal
> HTable instance cann't be continue to use. The connection in HTable should be
> recreated.
> Actually, there's two aproach to solve this:
> 1. In user code, once catch an IOE, close connection and re-create HTable
> instance. We can use this as a workaround.
> 2. In HBase Client side, catch this exception, and re-create connection.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira