Thanks Stack. will move to 20.3 or 20 trunk very soon. more responses inline below.
On Wed, Apr 7, 2010 at 8:52 PM, Stack <st...@duboce.net> wrote: > On Wed, Apr 7, 2010 at 7:49 PM, Raghu Angadi <rang...@apache.org> wrote: > > We are working with a small HBase cluster (5 nodes) with fairly beefy > nodes. > > While looking at why all the regionservers died at one time, noticed that > > these servers read some files 100s of times a second. This may not be > cause > > of the error... but do you think this is odd? > > > Check end of regionserver log. Should say why RegionServer went away. > The usual reason is long GC pause, one that is longer than zk session > timeout. This seems to be the case... There were CMS GC failures (promotion failed, Full GC etc). There were 4-5 pauses of about 4-10 seconds over a minute or so. Is that enough to kill ZK session? We are increasing the memory and will go through tuning tips on wiki. There are various other errors in the log over couple of hours of RS run. will post a link to the full log. --- failure on RS-72 --- 2010-04-06 22:11:07,668 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x127d58da4e70002 to sun.nio.ch.selectionkeyi...@426295eb java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) 2010-04-06 22:11:07,668 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x27d58da6de0088 to sun.nio.ch.selectionkeyi...@283f4633 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) 2010-04-06 22:11:07,672 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 27 on 60020, call put([...@20a192c7, [Lorg.apache.hadoop.hbase.client.Put;@4fab578d) from 10.10.0.72:60211: error: java.io.IOException: Server not running, aborting java.io.IOException: Server not running, aborting at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2345) ------------ --- failure on RS-73 after a few minutes --- 2010-04-06 22:21:41,867 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -4957903368956265878 lease expired 2010-04-06 22:21:47,806 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x127d58da4e7002a to sun.nio.ch.selectionkeyi...@15ef1241 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) 2010-04-06 22:21:47,806 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39) at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) ------- [...] > > 2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Could not > > obtain block blk_-7610953303919156937_1089667 from any node: > > java.io.IOException: No live nodes contain current block > > [...] > > ---- > > Are you accessing from mapreduce? If so, does your hadoop have hdfs-127? > > Then there are the usual suspects. Xceivers count -- up it to 2k or > so -- and ulimit should be much greater than the default 1024. yes. Most of the traffic now is puts from reducers. I think HDFS is a recent Cloudera release. I will check. Most likely it wont have hdfs-127. yup.. we hit Xceivers limit very early. The limit is 2k and fd limit is also high. [...] > > There are thousands of repeated reads of many small files like this. > > > > --- From NN log, this block was created > > for /hbase/.META./1028785192/info/1728561479703335912 > > 2010-04-06 21:51:20,906 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > > NameSystem.allocateBlock: > /hbase/.META./1028785192/info/1728561479703335912. > > blk_8972126557191254374_1090962 > > ---- > > > > Btw, we had single replication set for this file by mistake. > > > > So, if any error getting a block, there is no recourse. Was there > concurrent processes sucking i/o from HDFS running at same time? > Writing, clients need to figure where to write. They'll do this by > doing lookup in .META. They'll then cache the info. If clients are > short-lived, then lots of .META. hits. > Client here is HBase client (in our case reducers)? > And as Ryan says, whats the caching stats look like for the .META. > region? (See server it was hosted on and check its logs -- we dump > cache metrics every minute or so). > What should I look for in the log for these stats? I didn't notice stats. I will send link to full log. Raghu. > St.Ack >