Re: Region server going down

Lucas Nazário dos Santos Mon, 19 Oct 2009 03:43:53 -0700

Thanks a lot Ryan. Very helpful your explanation. It's not the first time
that I see someone saying that the indexed option is not "as inherently
scalable". I'll remove it and take care of my indexes manually. Also, I need
to fix the swap problem.


Lucas




On Fri, Oct 16, 2009 at 10:12 PM, Ryan Rawson <[email protected]> wrote:

> Hey,
>
> Zookeeper is a pretty fundamental part of how we are making things
> happen in hbase.  The problem is when you lose your session, this is
> how we synchronize between the master and the regionserver.  At this
> point neither side knows what the other knows, and the safest thing is
> to abort the regionserver.  Without that, we can end up with multiple
> region assignments which is pretty messy.
>
> ZK is like DNS and the network, without it running, we are more or
> less in trouble.  There is no effective difference between a crashed
> machine and one that is having network problems, so they are treated
> the same and recovery is the same.
>
> Having said that, the session timeout is set in hbase, and i think
> ships at 40 seconds or so.  So it should take more than a minor
> problem or a few lost packets to induce a crash.  Now having said
> that, if you are killing the entire ZK cluster and expecting HBase to
> be ok, that is not really what will happen.  This is why ZK is run in
> a 2N+1 scenario, so you can do rolling reboots, and survive N machine
> loss.  But ZK is requires to be up 24/7, luckily it is fairly
> reliable.
>
> With hdfs 0.21, at least we'll be able to have effective hlog recovery.
>
> Now, your specific problem looks like a common issue with the master
> and regionservers being confused about what type of server they are
> running. I don't personally run the indexed or transactional
> extensions (they are not as inherently scalable), so maybe someone
> else can chime in.
>
> -ryan
>
> On Fri, Oct 16, 2009 at 1:29 PM, Lucas Nazário dos Santos
> <[email protected]> wrote:
> > Hi,
> >
> > Today one regionserver crashed and I can't figure out why. Everything
> > started with the message "server,60020,1255644477834 znode expired". I'm
> > still running the cluster on little memory and swap is getting in my way
> > from time to time (it's rare but I need to fix it). Can it be the cause
> of
> > the error bellow? Do you think that five minutes is enough for the
> property
> > zookeeper.session.timeout? Why the message "wrong key class:
> > org.apache.hadoop.hbase.regionserver.HLogKey is not class"?
> >
> > My tests show that whenever zookeeper "shakes" the whole cluster goes
> down.
> > Shouldn't HBase be more robust regarding Zookeeper? Something like a
> retry
> > strategy...
> >
> > Lucas
> >
> >
> >
> > 2009-10-16 15:07:32,167 INFO
> org.apache.hadoop.hbase.master.ServerManager: 2
> > region servers, 0 dead, average load 7.0
> > 2009-10-16 15:07:32,537 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scanning meta region {server:
> 192.168.1.2:60020,
> > regionname: -ROOT-,,0, startKey: <>}
> > 2009-10-16 15:07:32,560 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> > 192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete
> > 2009-10-16 15:07:32,654 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.metaScanner scanning meta region {server:
> 192.168.1.3:60020,
> > regionname: .META.,,1, startKey: <>}
> > 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.metaScanner scan of 12 row(s) of meta region {server:
> > 192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete
> > 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner:
> All
> > 1 .META. region(s) scanned
> > 2009-10-16 15:08:09,551 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > server,60020,1255644477834 znode expired
> > 2009-10-16 15:08:09,605 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> > -ROOT- region unset (but not set to be reassigned)
> > 2009-10-16 15:08:09,605 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> > server server,60020,1255644477834: logSplit: false, rootRescanned: false,
> > numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > 2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting 20 hlog(s) in
> > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> > 2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > Exception processing
> >
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353
> > -- continuing. Possible DATA LOSS!
> > java.io.IOException: wrong key class:
> > org.apache.hadoop.hbase.regionserver.HLogKey is not class
> > org.apache.hadoop.hbase.regionserver.transactional.THLogKey
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> >        at
> org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
> >        at
> org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
> >        at
> >
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
> >        at
> > org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
> >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> > 2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > Exception processing
> >
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463
> > -- continuing. Possible DATA LOSS!
> > java.io.IOException: wrong key class:
> > org.apache.hadoop.hbase.regionserver.HLogKey is not class
> > org.apache.hadoop.hbase.regionserver.transactional.THLogKey
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> >        at
> org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
> >        at
> org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
> >        at
> >
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
> >        at
> > org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
> >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> > 2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556
> >
> > // More wrong key class errors...
> >
> > 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog:
> hlog
> > file splitting completed in 594 millis for
> > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> > 2009-10-16 15:08:10,203 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete,
> > meta reassignment and scanning:
> > 2009-10-16 15:08:10,203 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation:
> ProcessServerShutdown
> > reassigning ROOT region
> > 2009-10-16 15:08:10,203 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> > -ROOT- region unset (but not set to be reassigned)
> > 2009-10-16 15:08:10,203 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> > ROOT inserted into regionsInTransition
> > 2009-10-16 15:08:32,167 INFO
> org.apache.hadoop.hbase.master.ServerManager: 1
> > region servers, 1 dead, average load 6.0[server,60020,1255644477834]
> >
>

Re: Region server going down

Reply via email to