[ https://issues.apache.org/jira/browse/HBASE-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244471#comment-13244471 ]
stack commented on HBASE-5666: ------------------------------ On creation of ZooKeeperWatcher, we do following. Why is it not sufficient? {code} // The first call against zk can fail with connection loss. Seems common. // Apparently this is recoverable. Retry a while. // See http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling // TODO: Generalize out in ZKUtil. long wait = conf.getLong(HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME, HConstants.DEFAULT_ZOOKEPER_RECOVERABLE_WAITIME); long finished = System.currentTimeMillis() + wait; KeeperException ke = null; do { try { ZKUtil.createAndFailSilent(this, baseZNode); ke = null; break; } catch (KeeperException.ConnectionLossException e) { if (LOG.isDebugEnabled() && (isFinishedRetryingRecoverable(finished))) { LOG.debug("Retrying zk create for another " + (finished - System.currentTimeMillis()) + "ms; set 'hbase.zookeeper.recoverable.waittime' to change " + "wait time); " + e.getMessage()); } ke = e; } } while (isFinishedRetryingRecoverable(finished)); {code} Is the wait too short? > RegionServer doesn't retry to check if base node is available > ------------------------------------------------------------- > > Key: HBASE-5666 > URL: https://issues.apache.org/jira/browse/HBASE-5666 > Project: HBase > Issue Type: Bug > Components: regionserver, zookeeper > Reporter: Matteo Bertozzi > Assignee: Matteo Bertozzi > Attachments: HBASE-5666-v1.patch, HBASE-5666-v2.patch, > HBASE-5666-v3.patch, HBASE-5666-v4.patch, hbase-1-regionserver.log, > hbase-2-regionserver.log, hbase-3-regionserver.log, hbase-master.log, > hbase-regionserver.log, hbase-zookeeper.log > > > I've a script that starts hbase and a couple of region servers in distributed > mode (hbase.cluster.distributed = true) > {code} > $HBASE_HOME/bin/start-hbase.sh > $HBASE_HOME/bin/local-regionservers.sh start 1 2 3 > {code} > but the region servers are not able to start... > It seems that during the RS start the the znode is still not available, and > HRegionServer.initializeZooKeeper() check just once if the base not is > available. > {code} > 2012-03-28 21:54:05,013 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Check the value > configured in 'zookeeper.znode.parent'. There could be a mismatch with the > one configured in the master. > 2012-03-28 21:54:08,598 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > localhost,60202,1332964444824: Initialization of RS failed. Hence aborting > RS. > java.io.IOException: Received the shutdown message while waiting. > at > org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:626) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:596) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:558) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:672) > at java.lang.Thread.run(Thread.java:662) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira