[ 
https://issues.apache.org/jira/browse/HBASE-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13250269#comment-13250269
 ] 

stack commented on HBASE-5666:
------------------------------

Patch looks good.

Logs '{
+            LOG.warn(zkw.prefix("Unable to set watcher on znode (" + znode + 
")"), keeperEx);
'
... but the method says its checkExists w/o setting watch.

I think this a bad idea; i.e. sleeping w/o interrupt.  How long is 
SOCKET_RETRY_WAIT_MS?  What if we try to stop the hosting server in meantime?  
We have to wait on this to come up out of this loop?

+        Threads.sleepWithoutInterrupt(HConstants.SOCKET_RETRY_WAIT_MS);

Passing 0, are we supposed to try once only?  My guess is that we could try 
more than once given how the loop runs; i.e. we may loop multiple times in same 
millisecond.. you might want to exit loop if timeout is zero.

What happens if a client comes in during this time?  It will crash out 
immediately because no base node?

Thanks Matteo.
                
> RegionServer doesn't retry to check if base node is available
> -------------------------------------------------------------
>
>                 Key: HBASE-5666
>                 URL: https://issues.apache.org/jira/browse/HBASE-5666
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.92.1, 0.94.0, 0.96.0
>            Reporter: Matteo Bertozzi
>            Assignee: Matteo Bertozzi
>         Attachments: HBASE-5666-v1.patch, HBASE-5666-v2.patch, 
> HBASE-5666-v3.patch, HBASE-5666-v4.patch, HBASE-5666-v5.patch, 
> HBASE-5666-v6.patch, hbase-1-regionserver.log, hbase-2-regionserver.log, 
> hbase-3-regionserver.log, hbase-master.log, hbase-regionserver.log, 
> hbase-zookeeper.log
>
>
> I've a script that starts hbase and a couple of region servers in distributed 
> mode (hbase.cluster.distributed = true)
> {code}
> $HBASE_HOME/bin/start-hbase.sh
> $HBASE_HOME/bin/local-regionservers.sh start 1 2 3
> {code}
> but the region servers are not able to start...
> It seems that during the RS start the the znode is still not available, and 
> HRegionServer.initializeZooKeeper() check just once if the base not is 
> available.
> {code}
> 2012-03-28 21:54:05,013 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Check the value 
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the 
> one configured in the master.
> 2012-03-28 21:54:08,598 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> localhost,60202,1332964444824: Initialization of RS failed.  Hence aborting 
> RS.
> java.io.IOException: Received the shutdown message while waiting.
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:626)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:596)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:558)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:672)
>       at java.lang.Thread.run(Thread.java:662)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to