Thanks Stack. will move to 20.3 or 20 trunk very soon. more responses inline
below.

On Wed, Apr 7, 2010 at 8:52 PM, Stack <st...@duboce.net> wrote:

> On Wed, Apr 7, 2010 at 7:49 PM, Raghu Angadi <rang...@apache.org> wrote:
> > We are working with a small HBase cluster (5 nodes) with fairly beefy
> nodes.
> > While looking at why all the regionservers died at one time, noticed that
> > these servers read some files 100s of times a second. This may not be
> cause
> > of the error... but do you think this is odd?
> >
> Check end of regionserver log.  Should say why RegionServer went away.
>  The usual reason is long GC pause, one that is longer than zk session
> timeout.


This seems to be the case... There were CMS GC failures (promotion failed,
Full GC etc). There were 4-5 pauses of about 4-10 seconds over a minute or
so. Is that enough to kill ZK session? We are increasing the memory and will
go through tuning tips on wiki.

There are various other errors in the log over couple of hours of RS run.
will post a link to the full log.

--- failure on RS-72 ---
2010-04-06 22:11:07,668 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x127d58da4e70002 to sun.nio.ch.selectionkeyi...@426295eb
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-04-06 22:11:07,668 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x27d58da6de0088 to sun.nio.ch.selectionkeyi...@283f4633
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-04-06 22:11:07,672 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 27 on 60020, call put([...@20a192c7,
[Lorg.apache.hadoop.hbase.client.Put;@4fab578d) from 10.10.0.72:60211:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2345)
------------

--- failure on RS-73 after a few minutes ---

2010-04-06 22:21:41,867 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-4957903368956265878 lease expired
2010-04-06 22:21:47,806 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x127d58da4e7002a to sun.nio.ch.selectionkeyi...@15ef1241
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-04-06 22:21:47,806 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
-------

 [...]

> > 2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> > obtain block blk_-7610953303919156937_1089667 from any node:
> >  java.io.IOException: No live nodes contain current block
> > [...]
> > ----
>
> Are you accessing from mapreduce?  If so, does your hadoop have hdfs-127?
>
> Then there are the usual suspects.  Xceivers count -- up it to 2k or
> so -- and ulimit should be much greater than the default 1024.


yes. Most of the traffic now is puts from reducers. I think HDFS is a recent
Cloudera release. I will check. Most likely it wont have hdfs-127.

yup.. we hit Xceivers limit very early. The limit is 2k and fd limit is also
high.

[...]

> > There are thousands of repeated reads of many small files like this.
> >
> > --- From NN log, this block was created
> > for /hbase/.META./1028785192/info/1728561479703335912
> > 2010-04-06 21:51:20,906 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> > NameSystem.allocateBlock:
> /hbase/.META./1028785192/info/1728561479703335912.
> > blk_8972126557191254374_1090962
> > ----
> >
> > Btw, we had single replication set for this file by mistake.
> >
>
> So, if any error getting a block, there is no recourse.  Was there
> concurrent processes sucking i/o from HDFS running at same time?



> Writing, clients need to figure where to write.  They'll do this by
> doing lookup in .META.  They'll then cache the info.  If clients are
> short-lived, then lots of .META. hits.
>

Client here is HBase client (in our case reducers)?


> And as Ryan says, whats the caching stats look like for the .META.
> region?  (See server it was hosted on and check its logs -- we dump
> cache metrics every minute or so).
>

What should I look for in the log for these stats? I didn't notice stats. I
will send link to  full log.

Raghu.


> St.Ack
>

Reply via email to