Hi,
I am repeatedly running into this issue where all the region servers tries to
restart but fails to come up. All the region servers seems to be having same
kind of exception which causes this state.
My cluster is as follows:
node1 : Master, NN, DN, RS, TT, XX
node2: Zookeeper, JT, DN, RS, TT, XX
node3: DN, RS, TT, XX
where XX is my own hbase client with around 150 threads writing to a common
table.
The setup works fine for some time and then goes down (after 20, 30 mins). Here
is the sequence in the region server logs..
* RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected,
type: None, path:
null
* RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28
02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer:
Processing message (Retry: 1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
* After 10 retries, gets another zoookeeper event : Got ZooKeeper
event, state: Expired, type: None, path: null
2009-06-28 02:14:17,751 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2009-06-28 02:14:17,751 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
* Decides to restart region server, but logs of error like this:
2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 280 on 60020, call exists([...@75880048,
row=724b330295375ad0ba68fa85325381, maxVersions=1,
timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945:
error: java.io.IOException: Ser
ver not running, aborting
* Above might be happening because client 'XX' still trying to write?
Finally it closes the region server and tries to restart. But gets the
following exception:2009-06-28 02:14:26,462 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2009-06-28 02:14:26,462 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Runs every 10000000ms
2009-06-28 02:14:26,462 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
2009-06-28 02:14:27,032 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.lang.NullPointerException
at
org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
at java.lang.Thread.run(Thread.java:619)
2009-06-28 02:14:27,110 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception.
Aborting...
java.io.IOException: Region server startup failed
at
org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
at
org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
... 2 more
2009-06-28 02:14:27,122 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=9, stores=10, storefil
es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995,
blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
blockCacheHitRatio=94
2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server
on 60020
2009-06-28 02:14:27,131 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-06-28 02:14:27,131 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-06-28 02:14:27,136 INFO
org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at:
0.0.0.0:60020
There region server dies after that. All the 3 region servers die like this and
I have to start the region server manually. But aftert 10-15 minutes, it runs
into the same stage again. Please help me in finding what is the root cause of
this?
Thanks,
Murali Krishna
/