On Sat, Feb 20, 2010 at 2:27 PM, Rod Cope <rod.c...@openlogic.com> wrote: > Thanks for the quick help, J-D. Answers in-line. > >> 1. Just to make sure, check out the region server log and grep for >> ulimit as it prints it out when it starts. > > Fri Feb 19 13:56:34 MST 2010 Starting regionserver on dd08 > ulimit -n 32768
Good. > >> >> 2. Did you give HBase more heap? (that won't fix your problem, just making >> sure) > > # The maximum amount of heap to use, in MB. Default is 1000. > export HBASE_HEAPSIZE=4000 > > All boxes have 32GB RAM. Awesome. > >> >> 3. No need to run 9 ZK quorum members, 3 is fine or even 1 fwiw in your case. > > Thanks for the suggestion - I'll go with 3. I was planning to run 5, but > hadn't gotten around to running a separate ZK quorum (they're managed by > HBase right now). Would having 9 cause problems? It doesn't seem to be > making trouble at this point. With 9 some write operations might take longer (search "Patrick Hunt" on this list and you will find links to graphs showing ops latency vs number of nodes). At StumbleUpon we run with quorums of 5, only 1 per datacenter. > >> 4. How many CPUs do you have per box? How much tasks are running at >> the same time? > > Boxes are each dual quad-core or dual hex-core with 6 hard drives and > they're all dedicated to Hadoop-related work. The one with the most recent > failure is a dual hex-core. It was running 7 map jobs (no reduces), ZK, > datanode, tasktracker, regionserver, Stargate, and Tomcat w/2 instances of > Solr. It's not using any swap (thanks to 32GB RAM). As far as I can tell, > it was never under too much stress in terms of CPU, memory, or disk - not > sure about network bandwidth, although the 9 boxes are on a dedicated Gbit > switch. > With the answer you gave to 5, I would not expect much stress. > >> 5. Did you grep the datanodes logs for exceptions? > > This is all for the box with the bad regionserver on the latest load: > > I found hundreds of these, the first of which is about 1 hour into the load: > > 2010-02-19 20:43:12,386 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(192.168.60.108:50010, > storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075, > ipcPort=50020):Got exception while serving blk_5418714232323545866_193910 to > /192.168.60.101: > java.net.SocketTimeoutException: 480000 millis timeout while waiting for > channel to be ready for write. ch : > java.nio.channels.SocketChannel[connected local=/192.168.60.108:50010 > remote=/192.168.60.101:47012] > > > A handful of these: > > 2010-02-19 15:28:41,481 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_4631267981694470746_202618 src: /192.168.60.106:52238 dest: > /192.168.60.108:50010 > 2010-02-19 15:28:41,482 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > blk_4631267981694470746_202618 received exception > org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block > blk_4631267981694470746_202618 is valid, and cannot be written to. > 2010-02-19 15:28:41,482 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(192.168.60.108:50010, > storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075, > ipcPort=50020):DataXceiver > org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block > blk_4631267981694470746_202618 is valid, and cannot be written to. > > > And 2 of these within 2 seconds of each other about an hour before the > regionserver starting flaking out: > > 2010-02-19 20:42:25,252 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(192.168.60.108:50010, > storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.IOException: xceiverCount 1025 exceeds the limit of concurrent > xcievers 1024 > > > > Looks like I need to bump up the number of xcievers. Any thoughts on the > SocketTimeoutExceptions or BlockAlreadyExistsExceptions? >From experience, these 2 issues are symptoms of either small number of file handles or xceivers. Have a look at http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A5 on how to configure it and some background information. > > Thanks, > Rod >