anoopsjohn commented on a change in pull request #3293:
URL: https://github.com/apache/hbase/pull/3293#discussion_r637917952



##########
File path: 
hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ReadOnlyZKClient.java
##########
@@ -344,6 +344,10 @@ private void run() {
         } catch (IOException e) {
           task.connectFailed(e);
           continue;
+        } catch (Exception e) {
+          // won't be doing any retries for non IOE.
+          task.closed(e);

Review comment:
       Actually in connect fail case, zk is supposed to throw IOE and we handle 
that above. The connectFailed will cause the retries.  As mentioned in Jira, 
some old versions of zk has a bug and cause a RuntimeException while zk 
hostname resolve.  So as such, on our dev branches this issue wont happen as we 
moved to zk 3.5.7
   Still I think its better to handle here.  Else we will end up hanging the 
Future.get() call ever.  In my cluster version where zk client is old, I see 
this causes ReplicationSource init to hang forver and WALs getting accumulated 
and replication lag.  Later when we see HDFS disk alerts, then only analyzed.   
 the culprit was DNS which gives temp hostname resolution issues.  
   I thought like whether we should catch Exception instead of IOE in 1st place 
and do connectFailed only so that we will get retry.  But just changed my mind 
later because its generic Exception  we catch.  Theoretically its IOE what we 
have to retry right?  Still am open for change.    There is nothing wrong if do 
few more retries before give up.
   BTW in my internal version, i will handle the latter way as I can not 
upgrade zk client version that easily.
   Thoughts Duo?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to