Re: Region server going down

Clint Morgan Fri, 30 Oct 2009 15:07:26 -0700

In your first post, you are hitting 1858. Fixed in trunk and 0.20 branch,
but you will need to add the config value to recover from the WAL.


I take issue with Ryan's handwavy statement about index/trx extensions not
being scalable.

With the indexing you pay an extra cost on puts which is essentially a
constant * number of indexes. But this would still scale with the number of
rows/requests.  If you want those indexes, then you will have to pay that
maintenance cost. And putting the maintenance in the regionserver makes the
gets to rebuild the indexes a bit cheaper.

Trx is a different story; it really depends on your work loads. But if you
have lots of small requests that don't often interfere with each other, then
it should scale.

On Mon, Oct 19, 2009 at 3:42 AM, Lucas Nazário dos Santos <
[email protected]> wrote:

> Thanks a lot Ryan. Very helpful your explanation. It's not the first time
> that I see someone saying that the indexed option is not "as inherently
> scalable". I'll remove it and take care of my indexes manually. Also, I
> need
> to fix the swap problem.
>
> Lucas
>
>
>
>
> On Fri, Oct 16, 2009 at 10:12 PM, Ryan Rawson <[email protected]> wrote:
>
> > Hey,
> >
> > Zookeeper is a pretty fundamental part of how we are making things
> > happen in hbase.  The problem is when you lose your session, this is
> > how we synchronize between the master and the regionserver.  At this
> > point neither side knows what the other knows, and the safest thing is
> > to abort the regionserver.  Without that, we can end up with multiple
> > region assignments which is pretty messy.
> >
> > ZK is like DNS and the network, without it running, we are more or
> > less in trouble.  There is no effective difference between a crashed
> > machine and one that is having network problems, so they are treated
> > the same and recovery is the same.
> >
> > Having said that, the session timeout is set in hbase, and i think
> > ships at 40 seconds or so.  So it should take more than a minor
> > problem or a few lost packets to induce a crash.  Now having said
> > that, if you are killing the entire ZK cluster and expecting HBase to
> > be ok, that is not really what will happen.  This is why ZK is run in
> > a 2N+1 scenario, so you can do rolling reboots, and survive N machine
> > loss.  But ZK is requires to be up 24/7, luckily it is fairly
> > reliable.
> >
> > With hdfs 0.21, at least we'll be able to have effective hlog recovery.
> >
> > Now, your specific problem looks like a common issue with the master
> > and regionservers being confused about what type of server they are
> > running. I don't personally run the indexed or transactional
> > extensions (they are not as inherently scalable), so maybe someone
> > else can chime in.
> >
> > -ryan
> >
> > On Fri, Oct 16, 2009 at 1:29 PM, Lucas Nazário dos Santos
> > <[email protected]> wrote:
> > > Hi,
> > >
> > > Today one regionserver crashed and I can't figure out why. Everything
> > > started with the message "server,60020,1255644477834 znode expired".
> I'm
> > > still running the cluster on little memory and swap is getting in my
> way
> > > from time to time (it's rare but I need to fix it). Can it be the cause
> > of
> > > the error bellow? Do you think that five minutes is enough for the
> > property
> > > zookeeper.session.timeout? Why the message "wrong key class:
> > > org.apache.hadoop.hbase.regionserver.HLogKey is not class"?
> > >
> > > My tests show that whenever zookeeper "shakes" the whole cluster goes
> > down.
> > > Shouldn't HBase be more robust regarding Zookeeper? Something like a
> > retry
> > > strategy...
> > >
> > > Lucas
> > >
> > >
> > >
> > > 2009-10-16 15:07:32,167 INFO
> > org.apache.hadoop.hbase.master.ServerManager: 2
> > > region servers, 0 dead, average load 7.0
> > > 2009-10-16 15:07:32,537 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > > RegionManager.rootScanner scanning meta region {server:
> > 192.168.1.2:60020,
> > > regionname: -ROOT-,,0, startKey: <>}
> > > 2009-10-16 15:07:32,560 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > > RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> > > 192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete
> > > 2009-10-16 15:07:32,654 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > > RegionManager.metaScanner scanning meta region {server:
> > 192.168.1.3:60020,
> > > regionname: .META.,,1, startKey: <>}
> > > 2009-10-16 15:07:32,804 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > > RegionManager.metaScanner scan of 12 row(s) of meta region {server:
> > > 192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete
> > > 2009-10-16 15:07:32,804 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > All
> > > 1 .META. region(s) scanned
> > > 2009-10-16 15:08:09,551 INFO
> > org.apache.hadoop.hbase.master.ServerManager:
> > > server,60020,1255644477834 znode expired
> > > 2009-10-16 15:08:09,605 INFO
> > org.apache.hadoop.hbase.master.RegionManager:
> > > -ROOT- region unset (but not set to be reassigned)
> > > 2009-10-16 15:08:09,605 INFO
> > > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown
> of
> > > server server,60020,1255644477834: logSplit: false, rootRescanned:
> false,
> > > numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > > 2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > > Splitting 20 hlog(s) in
> > > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> > > 2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > > Exception processing
> > >
> >
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353
> > > -- continuing. Possible DATA LOSS!
> > > java.io.IOException: wrong key class:
> > > org.apache.hadoop.hbase.regionserver.HLogKey is not class
> > > org.apache.hadoop.hbase.regionserver.transactional.THLogKey
> > >        at
> > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
> > >        at
> > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> > >        at
> > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
> > >        at
> > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
> > >        at
> > >
> >
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
> > >        at
> > >
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
> > >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> > > 2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > > Exception processing
> > >
> >
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463
> > > -- continuing. Possible DATA LOSS!
> > > java.io.IOException: wrong key class:
> > > org.apache.hadoop.hbase.regionserver.HLogKey is not class
> > > org.apache.hadoop.hbase.regionserver.transactional.THLogKey
> > >        at
> > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
> > >        at
> > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> > >        at
> > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
> > >        at
> > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
> > >        at
> > >
> >
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
> > >        at
> > >
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
> > >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> > > 2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > > Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556
> > >
> > > // More wrong key class errors...
> > >
> > > 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > hlog
> > > file splitting completed in 594 millis for
> > > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> > > 2009-10-16 15:08:10,203 INFO
> > > org.apache.hadoop.hbase.master.RegionServerOperation: Log split
> complete,
> > > meta reassignment and scanning:
> > > 2009-10-16 15:08:10,203 INFO
> > > org.apache.hadoop.hbase.master.RegionServerOperation:
> > ProcessServerShutdown
> > > reassigning ROOT region
> > > 2009-10-16 15:08:10,203 INFO
> > org.apache.hadoop.hbase.master.RegionManager:
> > > -ROOT- region unset (but not set to be reassigned)
> > > 2009-10-16 15:08:10,203 INFO
> > org.apache.hadoop.hbase.master.RegionManager:
> > > ROOT inserted into regionsInTransition
> > > 2009-10-16 15:08:32,167 INFO
> > org.apache.hadoop.hbase.master.ServerManager: 1
> > > region servers, 1 dead, average load 6.0[server,60020,1255644477834]
> > >
> >
>

Re: Region server going down

Reply via email to