I¹ve been loading some large data sets over the last week or so, but keep running into failures between 4 and 15 hours into the process. I¹ve wiped HBase and/or HDFS a few times hoping that would help, but it hasn¹t.
I¹ve implemented all the recommendations for increasing file limits and the like on the troubleshooting wiki page. There¹s plenty of free disk space and memory with no swap being used on any of the 9 machines in the cluster. All 9 boxes run a managed ZK, regionserver, datanode, and MR jobs loading data from HDFS and NFS-mounted disk into HBase. Doing a zk_dump shows an average of 1 for all machines with the highest max being 621. The regionserver having trouble varies from load to load, so the problem doesn¹t appear to be machine-specific. You can see in the logs below that a compaction is started which leads to a LeaseExpiredException: File does not exist (I¹ve done a hadoop get and it¹s really not there). Then an Error Recovery for a block, compaction/split fail, ³Premeture EOF from inputStream², ³No live nodes contain current block², and finally ³Cannot open filename². At this point, there¹s a meltdown where the vast majority of the rest of the log is filled with exceptions like these back to back. The regionserver doesn¹t go down, however. I¹m on the released HBase 0.20.3 with Hadoop 0.20.2 as of yesterday (RC4). I upgraded Hadoop from 0.20.1 hoping that would help some of the problems I¹ve been having, but it only seemed to change the details of the exceptions and not the results. Once I upgraded to Hadoop 0.20.2, I replaced HBase's hadoop-0.20.1-hdfs127-core.jar in lib with the new hadoop-0.20.2-core.jar. Any ideas? I¹m really under the gun to get this data loaded, so any workarounds or other recommendations are much appreciated. Thanks, Rod ---- Here¹s a link to the logs below in case they¹re not easy to read: http://pastebin.com/d7907bca 2010-02-19 21:59:24,950 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for region files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk\x7Csrc/svn/n/ne/n erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606/ 25429292 because: Region has references on open 2010-02-19 21:59:24,950 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on region files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk\x7Csrc/svn/n/ne/n erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606 2010-02-19 21:59:24,953 DEBUG org.apache.hadoop.hbase.regionserver.Store: Started compaction of 4 file(s), hasReferences=true, into /hbase/files/compaction.dir/25429292, seqid=2811972 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/files/compaction.dir/25429292/2021896477663224037 File does not exist. [Lease. Holder: DFSClient_-1386101021, pendingcreates: 1] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem. java:1332) (...rest of stack trace...) 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_2006633705539782284_253567 bad datanode[0] nodes == null 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/files/compaction.dir/25429292/2021896477663224037" - Aborting... 2010-02-19 21:59:27,997 ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk\x7Csrc/svn/n/ne/n erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/files/compaction.dir/25429292/2021896477663224037 File does not exist. [Lease. Holder: DFSClient_-1386101021, pendingcreates: 1] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem. java:1332) (...rest of stack trace...) 2010-02-19 22:00:23,627 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=624.38275MB (654712760), Free=172.29224MB (180661512), Max=796.675MB (835374272), Counts: Blocks=9977, Access=3726192, Hit=2782447, Miss=943745, Evictions=67, Evicted=85131, Ratios: Hit Ratio=74.67266917228699%, Miss Ratio=25.327330827713013%, Evicted/Run=1270.6119384765625 2010-02-19 22:00:41,978 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-5162944092610390422_253522 from any node: java.io.IOException: No live nodes contain current block 2010-02-19 22:00:44,990 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-5162944092610390422_253522 from any node: java.io.IOException: No live nodes contain current block 2010-02-19 22:00:47,994 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Cannot open filename /hbase/files/929080390/metadata/6217150884710004337 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497 ) (...rest of stack trace...) 2010-02-19 22:00:47,994 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException: Premeture EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) (...rest of stack trace...) 2010-02-19 22:00:47,995 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 76 on 60020, call get([...@3a73f53, row=netbeans|https://olex.openlogic.com/packages/netbeans|src/archive/n/ne/n etbeans/5.0/netbeans-5.0-src/apisupport/l10n.list, maxVersions=1, timeRange=[0,9223372036854775807), families={(family=metadata, columns={updated_at}}) from 192.168.60.106:45445: error: java.io.IOException: Premeture EOF from inputStream java.io.IOException: Premeture EOF from inputStream (...rest of stack trace...) 2010-02-19 22:00:49,009 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-5162944092610390422_253522 from any node: java.io.IOException: No live nodes contain current block 2010-02-19 22:00:52,019 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-5162944092610390422_253522 from any node: java.io.IOException: No live nodes contain current block 2010-02-19 22:00:54,514 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on files,python\x7Chttps://olex.openlogic.com/packages/python\x7Csrc/archive/p/ py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 2010-02-19 22:00:54,520 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for region files,python\x7Chttps://olex.openlogic.com/packages/python\x7Csrc/archive/p/ py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429. Current region memstore size 64.1m 2010-02-19 22:00:54,911 DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dd01:54310/hbase/files/1086732894/content/9096973985255757264, entries=4486, sequenceid=2812095, memsize=29.5m, filesize=10.8m to files,python\x7Chttps://olex.openlogic.com/packages/python\x7Csrc/archive/p/ py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 2010-02-19 22:00:54,987 DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dd01:54310/hbase/files/1086732894/metadata/3183633054937023200, entries=28453, sequenceid=2812095, memsize=8.2m, filesize=638.5k to files,python\x7Chttps://olex.openlogic.com/packages/python\x7Csrc/archive/p/ py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429 2010-02-19 22:00:55,022 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Cannot open filename /hbase/files/929080390/metadata/6217150884710004337 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497 ) (...rest of stack trace...)
