Thanks for the quick help, J-D. Answers in-line. > 1. Just to make sure, check out the region server log and grep for > ulimit as it prints it out when it starts.
Fri Feb 19 13:56:34 MST 2010 Starting regionserver on dd08 ulimit -n 32768 > > 2. Did you give HBase more heap? (that won't fix your problem, just making > sure) # The maximum amount of heap to use, in MB. Default is 1000. export HBASE_HEAPSIZE=4000 All boxes have 32GB RAM. > > 3. No need to run 9 ZK quorum members, 3 is fine or even 1 fwiw in your case. Thanks for the suggestion - I'll go with 3. I was planning to run 5, but hadn't gotten around to running a separate ZK quorum (they're managed by HBase right now). Would having 9 cause problems? It doesn't seem to be making trouble at this point. > 4. How many CPUs do you have per box? How much tasks are running at > the same time? Boxes are each dual quad-core or dual hex-core with 6 hard drives and they're all dedicated to Hadoop-related work. The one with the most recent failure is a dual hex-core. It was running 7 map jobs (no reduces), ZK, datanode, tasktracker, regionserver, Stargate, and Tomcat w/2 instances of Solr. It's not using any swap (thanks to 32GB RAM). As far as I can tell, it was never under too much stress in terms of CPU, memory, or disk - not sure about network bandwidth, although the 9 boxes are on a dedicated Gbit switch. > 5. Did you grep the datanodes logs for exceptions? This is all for the box with the bad regionserver on the latest load: I found hundreds of these, the first of which is about 1 hour into the load: 2010-02-19 20:43:12,386 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.60.108:50010, storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075, ipcPort=50020):Got exception while serving blk_5418714232323545866_193910 to /192.168.60.101: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.60.108:50010 remote=/192.168.60.101:47012] A handful of these: 2010-02-19 15:28:41,481 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_4631267981694470746_202618 src: /192.168.60.106:52238 dest: /192.168.60.108:50010 2010-02-19 15:28:41,482 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_4631267981694470746_202618 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_4631267981694470746_202618 is valid, and cannot be written to. 2010-02-19 15:28:41,482 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.60.108:50010, storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075, ipcPort=50020):DataXceiver org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_4631267981694470746_202618 is valid, and cannot be written to. And 2 of these within 2 seconds of each other about an hour before the regionserver starting flaking out: 2010-02-19 20:42:25,252 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.60.108:50010, storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: xceiverCount 1025 exceeds the limit of concurrent xcievers 1024 Looks like I need to bump up the number of xcievers. Any thoughts on the SocketTimeoutExceptions or BlockAlreadyExistsExceptions? Thanks, Rod > On Sat, Feb 20, 2010 at 1:05 PM, Rod Cope <rod.c...@openlogic.com> wrote: >> Just did this on every box and got pretty much the same answers on each: >> >> [had...@dd08 logs]$ ulimit -n; /usr/sbin/lsof | wc -l >> 32768 >> 3589 >> >> Any other thoughts? >> >> Rod >> >> On 2/20/10 Saturday, February 20, 20101:58 PM, "Dan Washusen" >> <d...@reactive.org> wrote: >> >>> Sorry I can't be more helpful but just to double check it's not a file >>> limits issue could you run the following on each of the hosts: >>> >>> $ ulimit -a >>> $ lsof | wc -l >>> >>> The first command will show you (among other things) the file limits, it >>> should be above the default 1024. The second will tell you have many files >>> are currently open... >>> >>> Cheers, >>> Dan >>> >>> On 21 February 2010 03:14, Rod Cope <rod.c...@openlogic.com> wrote: >>> >>>> I¹ve been loading some large data sets over the last week or so, but keep >>>> running into failures between 4 and 15 hours into the process. I¹ve wiped >>>> HBase and/or HDFS a few times hoping that would help, but it hasn¹t. >>>> >>>> I¹ve implemented all the recommendations for increasing file limits and the >>>> like on the troubleshooting wiki page. There¹s plenty of free disk space >>>> and memory with no swap being used on any of the 9 machines in the cluster. >>>> All 9 boxes run a managed ZK, regionserver, datanode, and MR jobs loading >>>> data from HDFS and NFS-mounted disk into HBase. Doing a zk_dump shows an >>>> average of 1 for all machines with the highest max being 621. The >>>> regionserver having trouble varies from load to load, so the problem >>>> doesn¹t >>>> appear to be machine-specific. >>>> >>>> You can see in the logs below that a compaction is started which leads to a >>>> LeaseExpiredException: File does not exist (I¹ve done a hadoop get and >>>> it¹s >>>> really not there). Then an Error Recovery for a block, compaction/split >>>> fail, ³Premeture EOF from inputStream², ³No live nodes contain current >>>> block², and finally ³Cannot open filename². At this point, there¹s a >>>> meltdown where the vast majority of the rest of the log is filled with >>>> exceptions like these back to back. The regionserver doesn¹t go down, >>>> however. >>>> >>>> I¹m on the released HBase 0.20.3 with Hadoop 0.20.2 as of yesterday (RC4). >>>> I upgraded Hadoop from 0.20.1 hoping that would help some of the problems >>>> I¹ve been having, but it only seemed to change the details of the >>>> exceptions >>>> and not the results. Once I upgraded to Hadoop 0.20.2, I replaced HBase's >>>> hadoop-0.20.1-hdfs127-core.jar in lib with the new hadoop-0.20.2-core.jar. >>>> >>>> Any ideas? I¹m really under the gun to get this data loaded, so any >>>> workarounds or other recommendations are much appreciated. >>>> >>>> Thanks, >>>> Rod >>>> >>>> ---- >>>> >>>> Here¹s a link to the logs below in case they¹re not easy to read: >>>> http://pastebin.com/d7907bca >>>> >>>> >>>> 2010-02-19 21:59:24,950 DEBUG >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction >>>> requested for region >>>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk >>>> \x7Csrc/svn/n/ne/n >>>> >>>> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606>>>> / >>>> 25429292 because: Region has references on open >>>> 2010-02-19 21:59:24,950 INFO org.apache.hadoop.hbase.regionserver.HRegion: >>>> Starting compaction on region >>>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk >>>> \x7Csrc/svn/n/ne/n >>>> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606 >>>> 2010-02-19 21:59:24,953 DEBUG org.apache.hadoop.hbase.regionserver.Store: >>>> Started compaction of 4 file(s), hasReferences=true, into >>>> /hbase/files/compaction.dir/25429292, seqid=2811972 >>>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer >>>> Exception: org.apache.hadoop.ipc.RemoteException: >>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on >>>> /hbase/files/compaction.dir/25429292/2021896477663224037 File does not >>>> exist. [Lease. Holder: DFSClient_-1386101021, pendingcreates: 1] >>>> at >>>> >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem>>>> . >>>> java:1332) >>>> (...rest of stack trace...) >>>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Error >>>> Recovery for block blk_2006633705539782284_253567 bad datanode[0] nodes == >>>> null >>>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Could not >>>> get >>>> block locations. Source file >>>> "/hbase/files/compaction.dir/25429292/2021896477663224037" - Aborting... >>>> 2010-02-19 21:59:27,997 ERROR >>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split >>>> failed for region >>>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk >>>> \x7Csrc/svn/n/ne/n >>>> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606 >>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: >>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on >>>> /hbase/files/compaction.dir/25429292/2021896477663224037 File does not >>>> exist. [Lease. Holder: DFSClient_-1386101021, pendingcreates: 1] >>>> at >>>> >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem>>>> . >>>> java:1332) >>>> (...rest of stack trace...) >>>> 2010-02-19 22:00:23,627 DEBUG >>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>> Total=624.38275MB (654712760), Free=172.29224MB (180661512), Max=796.675MB >>>> (835374272), Counts: Blocks=9977, Access=3726192, Hit=2782447, Miss=943745, >>>> Evictions=67, Evicted=85131, Ratios: Hit Ratio=74.67266917228699%, Miss >>>> Ratio=25.327330827713013%, Evicted/Run=1270.6119384765625 >>>> 2010-02-19 22:00:41,978 INFO org.apache.hadoop.hdfs.DFSClient: Could not >>>> obtain block blk_-5162944092610390422_253522 from any node: >>>> java.io.IOException: No live nodes contain current block >>>> 2010-02-19 22:00:44,990 INFO org.apache.hadoop.hdfs.DFSClient: Could not >>>> obtain block blk_-5162944092610390422_253522 from any node: >>>> java.io.IOException: No live nodes contain current block >>>> 2010-02-19 22:00:47,994 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: >>>> java.io.IOException: Cannot open filename >>>> /hbase/files/929080390/metadata/6217150884710004337 >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:149>>>> 7 >>>> ) >>>> (...rest of stack trace...) >>>> 2010-02-19 22:00:47,994 ERROR >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException: >>>> Premeture EOF from inputStream >>>> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) >>>> (...rest of stack trace...) >>>> 2010-02-19 22:00:47,995 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server >>>> handler 76 on 60020, call get([...@3a73f53, >>>> row=netbeans| >>>> https://olex.openlogic.com/packages/netbeans|src/archive/n/ne/n >>>> etbeans/5.0/netbeans-5.0-src/apisupport/l10n.list, maxVersions=1, >>>> timeRange=[0,9223372036854775807), families={(family=metadata, >>>> columns={updated_at}}) from 192.168.60.106:45445: error: >>>> java.io.IOException: Premeture EOF from inputStream >>>> java.io.IOException: Premeture EOF from inputStream >>>> (...rest of stack trace...) >>>> 2010-02-19 22:00:49,009 INFO org.apache.hadoop.hdfs.DFSClient: Could not >>>> obtain block blk_-5162944092610390422_253522 from any node: >>>> java.io.IOException: No live nodes contain current block >>>> 2010-02-19 22:00:52,019 INFO org.apache.hadoop.hdfs.DFSClient: Could not >>>> obtain block blk_-5162944092610390422_253522 from any node: >>>> java.io.IOException: No live nodes contain current block >>>> 2010-02-19 22:00:54,514 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: >>>> Flush requested on >>>> files,python\x7Chttps://olex.openlogic.com/packages/python >>>> \x7Csrc/archive/p/ >>>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 >>>> 2010-02-19 22:00:54,520 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: >>>> Started memstore flush for region >>>> files,python\x7Chttps://olex.openlogic.com/packages/python >>>> \x7Csrc/archive/p/ >>>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429. Current >>>> region memstore size 64.1m >>>> 2010-02-19 22:00:54,911 DEBUG org.apache.hadoop.hbase.regionserver.Store: >>>> Added hdfs://dd01:54310/hbase/files/1086732894/content/9096973985255757264, >>>> entries=4486, sequenceid=2812095, memsize=29.5m, filesize=10.8m to >>>> files,python\x7Chttps://olex.openlogic.com/packages/python >>>> \x7Csrc/archive/p/ >>>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 >>>> 2010-02-19 22:00:54,987 DEBUG org.apache.hadoop.hbase.regionserver.Store: >>>> Added >>>> hdfs://dd01:54310/hbase/files/1086732894/metadata/3183633054937023200, >>>> entries=28453, sequenceid=2812095, memsize=8.2m, filesize=638.5k to >>>> files,python\x7Chttps://olex.openlogic.com/packages/python >>>> \x7Csrc/archive/p/ >>>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 >>>> 2010-02-19 22:00:55,022 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: >>>> java.io.IOException: Cannot open filename >>>> /hbase/files/929080390/metadata/6217150884710004337 >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:149>>>> 7 >>>> ) >>>> (...rest of stack trace...) >>>> >> >> >> -- >> >> Rod Cope | CTO and Founder >> rod.c...@openlogic.com >> Follow me on Twitter @RodCope >> >> 720 240 4501 | phone >> 720 240 4557 | fax >> 1 888 OpenLogic | toll free >> www.openlogic.com >> Follow OpenLogic on Twitter @openlogic >> >> >> >> >> >> -- Rod Cope | CTO and Founder rod.c...@openlogic.com Follow me on Twitter @RodCope 720 240 4501 | phone 720 240 4557 | fax 1 888 OpenLogic | toll free www.openlogic.com Follow OpenLogic on Twitter @openlogic