Thanks a lot Ryan. Very helpful your explanation. It's not the first time that I see someone saying that the indexed option is not "as inherently scalable". I'll remove it and take care of my indexes manually. Also, I need to fix the swap problem.
Lucas On Fri, Oct 16, 2009 at 10:12 PM, Ryan Rawson <[email protected]> wrote: > Hey, > > Zookeeper is a pretty fundamental part of how we are making things > happen in hbase. The problem is when you lose your session, this is > how we synchronize between the master and the regionserver. At this > point neither side knows what the other knows, and the safest thing is > to abort the regionserver. Without that, we can end up with multiple > region assignments which is pretty messy. > > ZK is like DNS and the network, without it running, we are more or > less in trouble. There is no effective difference between a crashed > machine and one that is having network problems, so they are treated > the same and recovery is the same. > > Having said that, the session timeout is set in hbase, and i think > ships at 40 seconds or so. So it should take more than a minor > problem or a few lost packets to induce a crash. Now having said > that, if you are killing the entire ZK cluster and expecting HBase to > be ok, that is not really what will happen. This is why ZK is run in > a 2N+1 scenario, so you can do rolling reboots, and survive N machine > loss. But ZK is requires to be up 24/7, luckily it is fairly > reliable. > > With hdfs 0.21, at least we'll be able to have effective hlog recovery. > > Now, your specific problem looks like a common issue with the master > and regionservers being confused about what type of server they are > running. I don't personally run the indexed or transactional > extensions (they are not as inherently scalable), so maybe someone > else can chime in. > > -ryan > > On Fri, Oct 16, 2009 at 1:29 PM, Lucas Nazário dos Santos > <[email protected]> wrote: > > Hi, > > > > Today one regionserver crashed and I can't figure out why. Everything > > started with the message "server,60020,1255644477834 znode expired". I'm > > still running the cluster on little memory and swap is getting in my way > > from time to time (it's rare but I need to fix it). Can it be the cause > of > > the error bellow? Do you think that five minutes is enough for the > property > > zookeeper.session.timeout? Why the message "wrong key class: > > org.apache.hadoop.hbase.regionserver.HLogKey is not class"? > > > > My tests show that whenever zookeeper "shakes" the whole cluster goes > down. > > Shouldn't HBase be more robust regarding Zookeeper? Something like a > retry > > strategy... > > > > Lucas > > > > > > > > 2009-10-16 15:07:32,167 INFO > org.apache.hadoop.hbase.master.ServerManager: 2 > > region servers, 0 dead, average load 7.0 > > 2009-10-16 15:07:32,537 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.rootScanner scanning meta region {server: > 192.168.1.2:60020, > > regionname: -ROOT-,,0, startKey: <>} > > 2009-10-16 15:07:32,560 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > > 192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete > > 2009-10-16 15:07:32,654 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.metaScanner scanning meta region {server: > 192.168.1.3:60020, > > regionname: .META.,,1, startKey: <>} > > 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.metaScanner scan of 12 row(s) of meta region {server: > > 192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete > > 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner: > All > > 1 .META. region(s) scanned > > 2009-10-16 15:08:09,551 INFO > org.apache.hadoop.hbase.master.ServerManager: > > server,60020,1255644477834 znode expired > > 2009-10-16 15:08:09,605 INFO > org.apache.hadoop.hbase.master.RegionManager: > > -ROOT- region unset (but not set to be reassigned) > > 2009-10-16 15:08:09,605 INFO > > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of > > server server,60020,1255644477834: logSplit: false, rootRescanned: false, > > numberOfMetaRegions: 1, onlineMetaRegions.size(): 1 > > 2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog: > > Splitting 20 hlog(s) in > > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834 > > 2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog: > > Exception processing > > > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353 > > -- continuing. Possible DATA LOSS! > > java.io.IOException: wrong key class: > > org.apache.hadoop.hbase.regionserver.HLogKey is not class > > org.apache.hadoop.hbase.regionserver.transactional.THLogKey > > at > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824) > > at > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) > > at > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896) > > at > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802) > > at > > > org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274) > > at > > org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490) > > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425) > > 2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog: > > Exception processing > > > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463 > > -- continuing. Possible DATA LOSS! > > java.io.IOException: wrong key class: > > org.apache.hadoop.hbase.regionserver.HLogKey is not class > > org.apache.hadoop.hbase.regionserver.transactional.THLogKey > > at > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824) > > at > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) > > at > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896) > > at > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802) > > at > > > org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274) > > at > > org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490) > > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425) > > 2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog: > > Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556 > > > > // More wrong key class errors... > > > > 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog: > hlog > > file splitting completed in 594 millis for > > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834 > > 2009-10-16 15:08:10,203 INFO > > org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete, > > meta reassignment and scanning: > > 2009-10-16 15:08:10,203 INFO > > org.apache.hadoop.hbase.master.RegionServerOperation: > ProcessServerShutdown > > reassigning ROOT region > > 2009-10-16 15:08:10,203 INFO > org.apache.hadoop.hbase.master.RegionManager: > > -ROOT- region unset (but not set to be reassigned) > > 2009-10-16 15:08:10,203 INFO > org.apache.hadoop.hbase.master.RegionManager: > > ROOT inserted into regionsInTransition > > 2009-10-16 15:08:32,167 INFO > org.apache.hadoop.hbase.master.ServerManager: 1 > > region servers, 1 dead, average load 6.0[server,60020,1255644477834] > > >
