Is there updates happening in your MR job?

If so the slowness might be cause from memcache flushing
and compaction with that many regions on so few servers
compaction would take a while to run on all the regions and
If its time for a major compaction then you are looking at a lot of cpu/disk/network work.

Guessing if the splits are set for 256MB then you average region should be close to 128MB or so 128MB * 481 = 60.125GB of data to compact that's a lot of data for one server to compaction

If you are seeing a lot of compactions happening in the logs with debug on then might try editing the config for compaction
let them happen less often.

Also just a question what's the stats on the servers you are hosting this: core numbers and Ghz speed, Total memory, and disk speed and type (5400,7200,15000-RPM)(IDE, SCSI)?

Billy




"Genady " <[email protected]> wrote in message news:0a2701c98195$3533b0a0$9f9b11...@com...
St.Ack,





Please see my answers below:





-----Original Message-----
From: stack [mailto:[email protected]]
Sent: Wednesday, January 28, 2009 9:43 PM
To: [email protected]
Cc: [email protected]
Subject: Re: Hbase 0.19 failed to start: exceeds the limit of concurrent
xcievers 3000



Genady wrote:

Thanks for your answer Jean-Adrien,







I've verified a setting the timeout parameter to the default value and

xceivers to original 3000(too small for our env regions number), after a

while HBase indeed succeeded to start( with tons of exceeds xceiver limit

exceptions), nevertheless performance of the MR task remain too slow, as

Jean-Daniel suggested( in previous post) probably as result of too much

regions per region server, so we going to increase file size and rebuild

data.



Leaving the default means that less resources are concurrently occupied

in the datanode -- sockets and threads of under utilized files have been

let go (you'll see the timeout exception in your log when the let-go

happens).   Resources are maximally used at startup when all the region

opens are happening.  You might even consider setting down the default

timeout from 8 minutes to something like 2 or 4 if you run into max

xceivers again.



Tell us more about your slowness before you go about changing region

sizes.  How is it slow?  Is it lookups against the .META. table?  Try

some yourself in the shell to see how well these are doing.  See if you

can narrow why its slow.  Are you swapping (as J-D asked earlier).  How

long does the MR job run?  Is it slow over its whole life?   Are your

tasks short?  If so, you might make them run longer so you better

exploit the cache of region locations built by a client.  How many

mappers do you have running concurrently?   If many, try cutting them in

half.





Gennady: As soon as HBase is up even copyFromLocal to Hadoop DFS is working
ten times much slower, my task have 10-20M records, which normally takes
about 10 minutes, now it takes about 1 hour, no swapping was noticed on all servers, MR tasks are slow all the time. Most strange is that nothing could
be seen in logs(debug is on), only higher than usual cpu rates(~90%) of
region and datanode servers. Besides cutting down thread stack size is there
anything else to try?





Thanks,

Gennady





Regarding your question about JVM errors, according to the following post
it

seems that in case of the following OOM error("java.lang.OutOfMemoryError:

unable to create new native thread"), increasing a heap size will not

prevent OOM problem:



http://www.egilh.com/blog/archive/2006/06/09/2811.aspx







Yes, its a complaint about resources outside of the JVM heap.  Upping

heap size won't help.  You could try playing with the -Xss -- thread

stack size -- downing it from whatever the java6 default is to see if

that helps.



St.Ack







Anyway after setting Hadoop heap size to 1 or !.5GB the error didn't come

back.







Gennady













probably as result of increasing xceivers thread number,













-----Original Message-----

From: Jean-Adrien [mailto:[email protected]]

Sent: Wednesday, January 28, 2009 6:03 PM

To: [email protected]

Subject: Re: Hbase 0.19 failed to start: exceeds the limit of concurrent

xcievers 3000











Hello Genady,







You might be interested in one of our previous post about this topic:



http://www.nabble.com/Datanode-Xceivers-td21372227.html







if you are using hadoop / HBase 0.19 you should leave the timeout



dfs.datanode.socket.write.timeout to its original default value 480000 (8



min)



Stack tested this, and the effect is that the Xcievers threads of hadoop



eventually ends with errors, but the errors does not affect HBase
stability



since HADOOP-3831 have been fixed for 0.19



And it should decrease the number of threads, and therefore the memory



needed for the jvm process.







Personally, I haven't updated to 0.19 yet, therefore I haven't tested this



for now, but I can't wait...







One think I don't understand in your problem is that the memory allocated



per thread in the jvm is not the heap, but the stack. Anyway the global



process virtual memory allocated should decrease (which allow you to



increase the heap.)



For your information I run 3 region servers with a 512Mb heap and about
150



regions each. I see my first OOM these days.



About Xcievers I see peaks of 1300 Xcievers during HBase startup with 2



datanodes, and a replication factor of 2; but if I enable the timeout I



guess about 800 should be enough.











Genady wrote:









Hi,























It seems that HBase 0.19 on Hadoop 0.19 fail to start because of
exceeding







limit of concurrent xceivers( in hadoop datanode logs), which is
currently







3000, setting more than 3000 xceivers is causing JVM out of memory







exception, is there is something wrong with configuration parameters of







cluster( three nodes, 430 regions,Hadoop heap size is default - 1GB)?







Additional parameters in hbase configuration are:











dfs.datanode.handler.count = 6,











dfs.datanode.socket.write.timeout=0























java.io.IOException: xceiverCount 3001 exceeds the limit of concurrent







xcievers 3000











        at








org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:87)









        at java.lang.Thread.run(Thread.java:619)























Any help is very appreciated,











Genady








































Reply via email to