[ 
https://issues.apache.org/jira/browse/HBASE-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244471#comment-13244471
 ] 

stack commented on HBASE-5666:
------------------------------

On creation of ZooKeeperWatcher, we do following.  Why is it not sufficient?

{code}
      // The first call against zk can fail with connection loss.  Seems common.
      // Apparently this is recoverable.  Retry a while.
      // See http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling
      // TODO: Generalize out in ZKUtil.
      long wait = conf.getLong(HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME,
          HConstants.DEFAULT_ZOOKEPER_RECOVERABLE_WAITIME);
      long finished = System.currentTimeMillis() + wait;
      KeeperException ke = null;
      do {
        try {
          ZKUtil.createAndFailSilent(this, baseZNode);
          ke = null;
          break;
        } catch (KeeperException.ConnectionLossException e) {
          if (LOG.isDebugEnabled() && 
(isFinishedRetryingRecoverable(finished))) {
            LOG.debug("Retrying zk create for another " +
              (finished - System.currentTimeMillis()) +
              "ms; set 'hbase.zookeeper.recoverable.waittime' to change " +
              "wait time); " + e.getMessage());
          }
          ke = e;
        }
      } while (isFinishedRetryingRecoverable(finished));
{code}

Is the wait too short?
                
> RegionServer doesn't retry to check if base node is available
> -------------------------------------------------------------
>
>                 Key: HBASE-5666
>                 URL: https://issues.apache.org/jira/browse/HBASE-5666
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>            Reporter: Matteo Bertozzi
>            Assignee: Matteo Bertozzi
>         Attachments: HBASE-5666-v1.patch, HBASE-5666-v2.patch, 
> HBASE-5666-v3.patch, HBASE-5666-v4.patch, hbase-1-regionserver.log, 
> hbase-2-regionserver.log, hbase-3-regionserver.log, hbase-master.log, 
> hbase-regionserver.log, hbase-zookeeper.log
>
>
> I've a script that starts hbase and a couple of region servers in distributed 
> mode (hbase.cluster.distributed = true)
> {code}
> $HBASE_HOME/bin/start-hbase.sh
> $HBASE_HOME/bin/local-regionservers.sh start 1 2 3
> {code}
> but the region servers are not able to start...
> It seems that during the RS start the the znode is still not available, and 
> HRegionServer.initializeZooKeeper() check just once if the base not is 
> available.
> {code}
> 2012-03-28 21:54:05,013 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Check the value 
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the 
> one configured in the master.
> 2012-03-28 21:54:08,598 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> localhost,60202,1332964444824: Initialization of RS failed.  Hence aborting 
> RS.
> java.io.IOException: Received the shutdown message while waiting.
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:626)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:596)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:558)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:672)
>       at java.lang.Thread.run(Thread.java:662)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to