anoopsjohn commented on a change in pull request #3293:
URL: https://github.com/apache/hbase/pull/3293#discussion_r637917952
##########
File path:
hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ReadOnlyZKClient.java
##########
@@ -344,6 +344,10 @@ private void run() {
} catch (IOException e) {
task.connectFailed(e);
continue;
+ } catch (Exception e) {
+ // won't be doing any retries for non IOE.
+ task.closed(e);
Review comment:
Actually in connect fail case, zk is supposed to throw IOE and we handle
that above. The connectFailed will cause the retries. As mentioned in Jira,
some old versions of zk has a bug and cause a RuntimeException while zk
hostname resolve. So as such, on our dev branches this issue wont happen as we
moved to zk 3.5.7
Still I think its better to handle here. Else we will end up hanging the
Future.get() call ever. In my cluster version where zk client is old, I see
this causes ReplicationSource init to hang forver and WALs getting accumulated
and replication lag. Later when we see HDFS disk alerts, then only analyzed.
the culprit was DNS which gives temp hostname resolution issues.
I thought like whether we should catch Exception instead of IOE in 1st place
and do connectFailed only so that we will get retry. But just changed my mind
later because its generic Exception we catch. Theoretically its IOE what we
have to retry right? Still am open for change. There is nothing wrong if do
few more retries before give up.
BTW in my internal version, i will handle the latter way as I can not
upgrade zk client version that easily.
Thoughts Duo?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]