[ https://issues.apache.org/jira/browse/HBASE-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721140#comment-16721140 ]
Hudson commented on HBASE-18058: -------------------------------- Results for branch branch-1.3 [build #576 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/576/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/576//General_Nightly_Build_Report/] (/) {color:green}+1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/576//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/576//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Zookeeper retry sleep time should have an upper limit > ----------------------------------------------------- > > Key: HBASE-18058 > URL: https://issues.apache.org/jira/browse/HBASE-18058 > Project: HBase > Issue Type: Bug > Affects Versions: 1.4.0, 2.0.0 > Reporter: Allan Yang > Assignee: Allan Yang > Priority: Major > Fix For: 1.4.0, 1.3.3, 2.0.0 > > Attachments: HBASE-18058-branch-1.patch, > HBASE-18058-branch-1.v2.patch, HBASE-18058-branch-1.v3.patch, > HBASE-18058.patch, HBASE-18058.v2.patch > > > Now, in {{RecoverableZooKeeper}}, the retry backoff sleep time grow > exponentially, but it doesn't have any up limit. It directly lead to a long > long recovery time after Zookeeper going down for some while and come back. > A case of damage done by high sleep time: > If the server hosting zookeeper is disk full, the zookeeper quorum won't > really went down but reject all write request. So at HBase side, new zk write > request will suffers from exception and retry. But connection remains so the > session won't timeout. When disk full situation have been resolved, the > zookeeper quorum can work normally again. But the very high sleep time cause > some module of RegionServer/HMaster will still sleep for a long time(for > example, the balancer) before working. -- This message was sent by Atlassian JIRA (v7.6.3#76005)