Can you post more of the regionserver logs prior to the crash? you can use pastebin.com if you'd like...
-ryan On Sun, Jun 28, 2009 at 10:12 PM, Murali Krishna. P<[email protected]> wrote: > Hi Andrew, > Thanks for looking into this. > I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' > file is missing. Now even if i go back to my old config, it still throws the > error :( > > Thanks, > Murali Krishna > > > > > ________________________________ > From: Andrew Purtell <[email protected]> > To: [email protected] > Sent: Sunday, 28 June, 2009 10:47:12 PM > Subject: Re: Region servers going down frequently (0.20 alpha) > > Hello, > > As a first step, deploy Zookeeper quorum peers on all of your nodes and > list all peers in the zoo.cfg files of your Zookeeper install and HBase: > > server.1=node1:2888:3888 > server.2=node2:2888:3888 > server.3=node3:2888:3888 > > Are you running mapreduce tasks as well as otherwise what you have described > below? > > Do you see any messages in the master or region server logs along the lines > of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have? > Do you have host level metrics running? If not, consider watching this with > Ganglia, or, in this case, since the cluster is so small three terminals > running top or atop. After 20, 30 minutes, is all available RAM full and are > the nodes going in to swap? > > - Andy > > > > > ________________________________ > From: Murali Krishna. P <[email protected]> > To: [email protected] > Sent: Sunday, June 28, 2009 8:23:27 AM > Subject: Region servers going down frequently (0.20 alpha) > > Hi, > I am repeatedly running into this issue where all the region servers tries > to restart but fails to come up. All the region servers seems to be having > same kind of exception which causes this state. > > My cluster is as follows: > node1 : Master, NN, DN, RS, TT, XX > node2: Zookeeper, JT, DN, RS, TT, XX > node3: DN, RS, TT, XX > > where XX is my own hbase client with around 150 threads writing to a common > table. > > The setup works fine for some time and then goes down (after 20, 30 mins). > Here is the sequence in the region server logs.. > > * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, > type: None, path: > null > * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 > 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: > Processing message (Retry: 1) > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, > state: Expired, type: None, path: null > 2009-06-28 02:14:17,751 ERROR > org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired > 2009-06-28 02:14:17,751 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server > * Decides to restart region server, but logs of error like this: > 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 280 on 60020, call exists([...@75880048, > row=724b330295375ad0ba68fa85325381, maxVersions=1, > timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: > error: java.io.IOException: Ser > ver not running, aborting > * Above might be happening because client 'XX' still trying to write? > Finally it closes the region server and tries to restart. But gets the > following exception:2009-06-28 02:14:26,462 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread. > 2009-06-28 02:14:26,462 INFO > org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms > 2009-06-28 02:14:26,462 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete > 2009-06-28 02:14:27,032 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431) > at java.lang.Thread.run(Thread.java:619) > 2009-06-28 02:14:27,110 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. > Aborting... > java.io.IOException: Region server startup failed > at > org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751) > at > org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713) > ... 2 more > 2009-06-28 02:14:27,122 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > request=0.0, regions=9, stores=10, storefil > es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, > blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765, > blockCacheHitRatio=94 > 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > 2009-06-28 02:14:27,131 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer > 2009-06-28 02:14:27,131 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog > 2009-06-28 02:14:27,136 INFO > org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: > 0.0.0.0:60020 > > There region server dies after that. All the 3 region servers die like this > and I have to start the region server manually. But aftert 10-15 minutes, it > runs into the same stage again. Please help me in finding what is the root > cause of this? > > Thanks, > Murali Krishna > /
