1. In 0.92 it should recover right away from those errors. 2. I happened to us, it's fine.
I might add that you don't need to stop zookeeper when stopping HBase. Our ZK ensembles have hundreds of days of uptime. J-D On Mon, Dec 5, 2011 at 5:10 AM, Mikael Sitruk <[email protected]> wrote: > Hi > > I would like to share with you my finding with the "Unable to read > additional data from server sessionid 0x0" zk error which prevented HBase > Master to start > > I have a cluster of 10 RS and a ZK quorum of 3 machines > I use a script to start the cluster, hdfs, mapreduce, zk quorum, HBMaster > and finally HBRS. > > Using the script everything started beside HBase. > > While checking into the log I found zk exception was thrown during the > startup: > 2011-12-05 00:05:34,622 ERROR > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master > java.lang.RuntimeException: Failed construction of Master: class > org.apache.hadoop.hbase.master.HMaster > at > org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1069) > at > org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:142) > at > org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:102) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) > at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1083) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:90) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133) > at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:223) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1064) > ... 5 more > > Googling on the subject did not provide enough insight for my problem. > > I checked zk, and from the shell I got the same kind of exception, > therefore I reinstalled zk, checked the command line and everything was ok. > I thought that it will be the same with HBase, but not! Again I got the > same behavior (HMaster failed), but this time zk was stable from the > command line (zkCli). > > I continued with several experiments, then I found the sequence of > operation that make the problem! > If I start the ZK quorum in and order that is different than the ZK leader > (the one with myid containing 1), the others zk and then immediately start > HBase master then HBase master will failed to load with the error above. > I added to the script 10 seconds wait between ZK start and HBase start and > it resolved the problem. > > I suppose that the reason of the problem is that when another zk server is > started prior the leader, then the zk quorum will begin some consensus to > elect a new leader and this may take several seconds, during this time ZK > quorum will not be available and HBMaster will failed to start. > > So I have several questions: > 1. Is there a way in HBase at startup to check this situation and initiate > a 10 second wait before trying to reconnect? > 2. Let suppose that HBase is in the middle of some work and zk failure > occurs (some node fail but still remaining n/2+1 zk server) and the > election protocol begin, does HBase will be ok, or will it begin a shutdown > sequence? My understanding is that HBase should be ok, as long as there is > a zk quorum available, it may just need to reconnect, but should not > shutdown nor be inaccessible. > > > Regards, > Mikael.S
