Just did this on every box and got pretty much the same answers on each: [had...@dd08 logs]$ ulimit -n; /usr/sbin/lsof | wc -l 32768 3589
Any other thoughts? Rod On 2/20/10 Saturday, February 20, 20101:58 PM, "Dan Washusen" <d...@reactive.org> wrote: > Sorry I can't be more helpful but just to double check it's not a file > limits issue could you run the following on each of the hosts: > > $ ulimit -a > $ lsof | wc -l > > The first command will show you (among other things) the file limits, it > should be above the default 1024. The second will tell you have many files > are currently open... > > Cheers, > Dan > > On 21 February 2010 03:14, Rod Cope <rod.c...@openlogic.com> wrote: > >> I¹ve been loading some large data sets over the last week or so, but keep >> running into failures between 4 and 15 hours into the process. I¹ve wiped >> HBase and/or HDFS a few times hoping that would help, but it hasn¹t. >> >> I¹ve implemented all the recommendations for increasing file limits and the >> like on the troubleshooting wiki page. There¹s plenty of free disk space >> and memory with no swap being used on any of the 9 machines in the cluster. >> All 9 boxes run a managed ZK, regionserver, datanode, and MR jobs loading >> data from HDFS and NFS-mounted disk into HBase. Doing a zk_dump shows an >> average of 1 for all machines with the highest max being 621. The >> regionserver having trouble varies from load to load, so the problem >> doesn¹t >> appear to be machine-specific. >> >> You can see in the logs below that a compaction is started which leads to a >> LeaseExpiredException: File does not exist (I¹ve done a hadoop get and >> it¹s >> really not there). Then an Error Recovery for a block, compaction/split >> fail, ³Premeture EOF from inputStream², ³No live nodes contain current >> block², and finally ³Cannot open filename². At this point, there¹s a >> meltdown where the vast majority of the rest of the log is filled with >> exceptions like these back to back. The regionserver doesn¹t go down, >> however. >> >> I¹m on the released HBase 0.20.3 with Hadoop 0.20.2 as of yesterday (RC4). >> I upgraded Hadoop from 0.20.1 hoping that would help some of the problems >> I¹ve been having, but it only seemed to change the details of the >> exceptions >> and not the results. Once I upgraded to Hadoop 0.20.2, I replaced HBase's >> hadoop-0.20.1-hdfs127-core.jar in lib with the new hadoop-0.20.2-core.jar. >> >> Any ideas? I¹m really under the gun to get this data loaded, so any >> workarounds or other recommendations are much appreciated. >> >> Thanks, >> Rod >> >> ---- >> >> Here¹s a link to the logs below in case they¹re not easy to read: >> http://pastebin.com/d7907bca >> >> >> 2010-02-19 21:59:24,950 DEBUG >> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction >> requested for region >> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk >> \x7Csrc/svn/n/ne/n >> >> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606/ >> 25429292 because: Region has references on open >> 2010-02-19 21:59:24,950 INFO org.apache.hadoop.hbase.regionserver.HRegion: >> Starting compaction on region >> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk >> \x7Csrc/svn/n/ne/n >> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606 >> 2010-02-19 21:59:24,953 DEBUG org.apache.hadoop.hbase.regionserver.Store: >> Started compaction of 4 file(s), hasReferences=true, into >> /hbase/files/compaction.dir/25429292, seqid=2811972 >> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer >> Exception: org.apache.hadoop.ipc.RemoteException: >> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on >> /hbase/files/compaction.dir/25429292/2021896477663224037 File does not >> exist. [Lease. Holder: DFSClient_-1386101021, pendingcreates: 1] >> at >> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem. >> java:1332) >> (...rest of stack trace...) >> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Error >> Recovery for block blk_2006633705539782284_253567 bad datanode[0] nodes == >> null >> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Could not >> get >> block locations. Source file >> "/hbase/files/compaction.dir/25429292/2021896477663224037" - Aborting... >> 2010-02-19 21:59:27,997 ERROR >> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split >> failed for region >> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk >> \x7Csrc/svn/n/ne/n >> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606 >> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: >> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on >> /hbase/files/compaction.dir/25429292/2021896477663224037 File does not >> exist. [Lease. Holder: DFSClient_-1386101021, pendingcreates: 1] >> at >> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem. >> java:1332) >> (...rest of stack trace...) >> 2010-02-19 22:00:23,627 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >> Total=624.38275MB (654712760), Free=172.29224MB (180661512), Max=796.675MB >> (835374272), Counts: Blocks=9977, Access=3726192, Hit=2782447, Miss=943745, >> Evictions=67, Evicted=85131, Ratios: Hit Ratio=74.67266917228699%, Miss >> Ratio=25.327330827713013%, Evicted/Run=1270.6119384765625 >> 2010-02-19 22:00:41,978 INFO org.apache.hadoop.hdfs.DFSClient: Could not >> obtain block blk_-5162944092610390422_253522 from any node: >> java.io.IOException: No live nodes contain current block >> 2010-02-19 22:00:44,990 INFO org.apache.hadoop.hdfs.DFSClient: Could not >> obtain block blk_-5162944092610390422_253522 from any node: >> java.io.IOException: No live nodes contain current block >> 2010-02-19 22:00:47,994 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: >> java.io.IOException: Cannot open filename >> /hbase/files/929080390/metadata/6217150884710004337 >> at >> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497 >> ) >> (...rest of stack trace...) >> 2010-02-19 22:00:47,994 ERROR >> org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException: >> Premeture EOF from inputStream >> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) >> (...rest of stack trace...) >> 2010-02-19 22:00:47,995 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server >> handler 76 on 60020, call get([...@3a73f53, >> row=netbeans| >> https://olex.openlogic.com/packages/netbeans|src/archive/n/ne/n >> etbeans/5.0/netbeans-5.0-src/apisupport/l10n.list, maxVersions=1, >> timeRange=[0,9223372036854775807), families={(family=metadata, >> columns={updated_at}}) from 192.168.60.106:45445: error: >> java.io.IOException: Premeture EOF from inputStream >> java.io.IOException: Premeture EOF from inputStream >> (...rest of stack trace...) >> 2010-02-19 22:00:49,009 INFO org.apache.hadoop.hdfs.DFSClient: Could not >> obtain block blk_-5162944092610390422_253522 from any node: >> java.io.IOException: No live nodes contain current block >> 2010-02-19 22:00:52,019 INFO org.apache.hadoop.hdfs.DFSClient: Could not >> obtain block blk_-5162944092610390422_253522 from any node: >> java.io.IOException: No live nodes contain current block >> 2010-02-19 22:00:54,514 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: >> Flush requested on >> files,python\x7Chttps://olex.openlogic.com/packages/python >> \x7Csrc/archive/p/ >> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 >> 2010-02-19 22:00:54,520 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: >> Started memstore flush for region >> files,python\x7Chttps://olex.openlogic.com/packages/python >> \x7Csrc/archive/p/ >> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429. Current >> region memstore size 64.1m >> 2010-02-19 22:00:54,911 DEBUG org.apache.hadoop.hbase.regionserver.Store: >> Added hdfs://dd01:54310/hbase/files/1086732894/content/9096973985255757264, >> entries=4486, sequenceid=2812095, memsize=29.5m, filesize=10.8m to >> files,python\x7Chttps://olex.openlogic.com/packages/python >> \x7Csrc/archive/p/ >> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 >> 2010-02-19 22:00:54,987 DEBUG org.apache.hadoop.hbase.regionserver.Store: >> Added >> hdfs://dd01:54310/hbase/files/1086732894/metadata/3183633054937023200, >> entries=28453, sequenceid=2812095, memsize=8.2m, filesize=638.5k to >> files,python\x7Chttps://olex.openlogic.com/packages/python >> \x7Csrc/archive/p/ >> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 >> 2010-02-19 22:00:55,022 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: >> java.io.IOException: Cannot open filename >> /hbase/files/929080390/metadata/6217150884710004337 >> at >> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497 >> ) >> (...rest of stack trace...) >> -- Rod Cope | CTO and Founder rod.c...@openlogic.com Follow me on Twitter @RodCope 720 240 4501 | phone 720 240 4557 | fax 1 888 OpenLogic | toll free www.openlogic.com Follow OpenLogic on Twitter @openlogic