Re: region server problem

stack Wed, 08 Oct 2008 15:19:01 -0700

If you haven't upped the ulimit for file descriptors, if you have morethan a handful of regions in your cluster, you start to experience'weirdness'. See http://wiki.apache.org/hadoop/Hbase/FAQ#6. You mightnot see the 'too many open files' message in your hbase logs (might bein your datanode logs) but symptom of not-enough-fds can be various,like an OOME.


St.Ack



Slava Gorelik wrote:

Hi.
I'll send log little bit later, with all answers on your questions, but what
do you mean - "You have upped your file descriptors?" ?

Best Regards.


On Wed, Oct 8, 2008 at 11:41 PM, stack <[EMAIL PROTECTED]> wrote:

You have DEBUG enabled?  Can I see log from the regionserver that went
down?  Can you tell me more about your cluster? Number of nodes, number of
regions?  What your uploader looks like (is it a MR job)?  You have upped
your file descriptors?

Thanks Slava.
St.Ack



Slava Gorelik wrote:

HI.I'm also encountering error like this.
I'm using Hbase 0.18.0 an Hadoop 0.18.0.
I addition to this error, i'm getting that sometimes region servers are
died, in the log i see region server shutdown, after starting compaction,
because that some data blocks are not found.

Best Regards.

On Wed, Oct 8, 2008 at 11:29 PM, stack <[EMAIL PROTECTED]> wrote:

You should update to 0.2.1 if you can.  Make sure you've upped your file
descriptors too:  See http://wiki.apache.org/hadoop/Hbase/FAQ#6.  Also
see
how to enable DEBUG in same FAQ.

Something odd is up when you see messages like this out of HDFS: ': No
live
nodes contain current block*'.  Thats lost data.

Or messages like this, 'compaction completed on region
search1,r3_1_3_c157476,1223360357528 in 18mins, 39sec' -- i.e. that
compactions are taking so long -- would seem to indicate your machines
are
severly overloaded or underpowered or both.  Can you study load when the
upload is running on these machines?  Perhaps try  throttling back to see
if
hbase survives longer?

The regionserver will output thread dump in its RPC layer if critical
error
-- OOME -- or its been hung up for a long time IIRC.

Check the '.out' logs too for you hbase install to see if they contain
any
errors.  Grep the datanode logs too for OOME or "too many open file
handles".

St.Ack

Rui Xing wrote:

Hi All,

1). We are doing performance testing on hbase. The environment of the
testing is 3 data nodes, and 1 name node distributed on 4 machines. We
started one region server on each data node respectively. To insert the
data, one insertion client is started on each data node machine. But as
the
data inserted, the region servers crashed one by one. One of the reasons
is
listed as follows:

*==>
2008-10-07 14:47:01,519 WARN org.apache.hadoop.dfs.DFSClient: Exception
while reading from blk_-806310822584979460 of
/hbase/search1/1201761134/col9/mapfiles/3578469984425427480/data from
10.2.6.102:50010: java.io.IOException: Premeture EOF from inputStream*

... ...

*2008-10-07 14:47:01,521 INFO org.apache.hadoop.dfs.DFSClient: Could not
obtain block blk_-806310822584979460 from any node:
 java.io.IOExceptionYou

2008-10-07 14:52:25,229 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region search1,r3_1_3_c157476,1223360357528 in
18mins, 39sec
2008-10-07 14:52:25,238 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver/0.0.0.0:60020.compactor exiting
2008-10-07 14:52:25,284 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
closed search1,r3_1_3_c157476,1223360357528
2008-10-07 14:52:25,291 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
closed -ROOT-,,0
2008-10-07 14:52:25,291 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at:
10.2.6.104:60020
2008-10-07 14:52:25,291 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/
0.0.0.0:60020 exiting
2008-10-07 14:52:25,511 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
thread.
2008-10-07 14:52:25,511 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread
complete
===<

2). Another question is, under what circunstance will the region server
print logs of the thread information as below? It appears among the
normal
log records.
===>
35 active threads
Thread 1281 (IPC Client connection to
d3v1.corp.alimama.com/10.2.6.101:54310
):
 State: RUNNABLE
 Blocked count: 0
 Waited count: 0
 Stack:
  java.util.Hashtable.remove(Hashtable.java:435)
  org.apache.hadoop.ipc.Client$Connection.run(Client.java:297)
... ...
===<

We use hadoop 0.17.1 and hbase 0.2.0. It would be greatly appreciated
if
any
clues can be dropped.

Regards,
-Ray

Re: region server problem

Reply via email to