I¹ve been loading some large data sets over the last week or so, but keep
running into failures between 4 and 15 hours into the process.  I¹ve wiped
HBase and/or HDFS a few times hoping that would help, but it hasn¹t.

I¹ve implemented all the recommendations for increasing file limits and the
like on the troubleshooting wiki page.  There¹s plenty of free disk space
and memory with no swap being used on any of the 9 machines in the cluster.
All 9 boxes run a managed ZK, regionserver, datanode, and MR jobs loading
data from HDFS and NFS-mounted disk into HBase.  Doing a zk_dump shows an
average of 1 for all machines with the highest max being 621.  The
regionserver having trouble varies from load to load, so the problem doesn¹t
appear to be machine-specific.

You can see in the logs below that a compaction is started which leads to a
LeaseExpiredException: File does not exist (I¹ve done a hadoop ­get and it¹s
really not there).  Then an Error Recovery for a block, compaction/split
fail, ³Premeture EOF from inputStream²,  ³No live nodes contain current
block², and finally ³Cannot open filename².  At this point, there¹s a
meltdown where the vast majority of the rest of the log is filled with
exceptions like these back to back.  The regionserver doesn¹t go down,
however.

I¹m on the released HBase 0.20.3 with Hadoop 0.20.2 as of yesterday (RC4).
I upgraded Hadoop from 0.20.1 hoping that would help some of the problems
I¹ve been having, but it only seemed to change the details of the exceptions
and not the results.  Once I upgraded to Hadoop 0.20.2, I replaced HBase's
hadoop-0.20.1-hdfs127-core.jar in lib with the new hadoop-0.20.2-core.jar.

Any ideas?  I¹m really under the gun to get this data loaded, so any
workarounds or other recommendations are much appreciated.

Thanks,
Rod

----

Here¹s a link to the logs below in case they¹re not easy to read:
http://pastebin.com/d7907bca


2010-02-19 21:59:24,950 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region
files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk\x7Csrc/svn/n/ne/n
erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606/
25429292 because: Region has references on open
2010-02-19 21:59:24,950 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting compaction on region
files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk\x7Csrc/svn/n/ne/n
erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606
2010-02-19 21:59:24,953 DEBUG org.apache.hadoop.hbase.regionserver.Store:
Started compaction of 4 file(s), hasReferences=true, into
/hbase/files/compaction.dir/25429292, seqid=2811972
2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/hbase/files/compaction.dir/25429292/2021896477663224037 File does not
exist. [Lease.  Holder: DFSClient_-1386101021, pendingcreates: 1]
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:1332)
      (...rest of stack trace...)
2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2006633705539782284_253567 bad datanode[0] nodes ==
null
2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Could not get
block locations. Source file
"/hbase/files/compaction.dir/25429292/2021896477663224037" - Aborting...
2010-02-19 21:59:27,997 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
failed for region 
files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk\x7Csrc/svn/n/ne/n
erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/hbase/files/compaction.dir/25429292/2021896477663224037 File does not
exist. [Lease.  Holder: DFSClient_-1386101021, pendingcreates: 1]
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:1332)
      (...rest of stack trace...)
2010-02-19 22:00:23,627 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=624.38275MB (654712760), Free=172.29224MB (180661512), Max=796.675MB
(835374272), Counts: Blocks=9977, Access=3726192, Hit=2782447, Miss=943745,
Evictions=67, Evicted=85131, Ratios: Hit Ratio=74.67266917228699%, Miss
Ratio=25.327330827713013%, Evicted/Run=1270.6119384765625
2010-02-19 22:00:41,978 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-5162944092610390422_253522 from any node:
java.io.IOException: No live nodes contain current block
2010-02-19 22:00:44,990 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-5162944092610390422_253522 from any node:
java.io.IOException: No live nodes contain current block
2010-02-19 22:00:47,994 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
java.io.IOException: Cannot open filename
/hbase/files/929080390/metadata/6217150884710004337
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497
)
      (...rest of stack trace...)
2010-02-19 22:00:47,994 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException:
Premeture EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
      (...rest of stack trace...)
2010-02-19 22:00:47,995 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 76 on 60020, call get([...@3a73f53,
row=netbeans|https://olex.openlogic.com/packages/netbeans|src/archive/n/ne/n
etbeans/5.0/netbeans-5.0-src/apisupport/l10n.list, maxVersions=1,
timeRange=[0,9223372036854775807), families={(family=metadata,
columns={updated_at}}) from 192.168.60.106:45445: error:
java.io.IOException: Premeture EOF from inputStream
java.io.IOException: Premeture EOF from inputStream
      (...rest of stack trace...)
2010-02-19 22:00:49,009 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-5162944092610390422_253522 from any node:
java.io.IOException: No live nodes contain current block
2010-02-19 22:00:52,019 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-5162944092610390422_253522 from any node:
java.io.IOException: No live nodes contain current block
2010-02-19 22:00:54,514 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on 
files,python\x7Chttps://olex.openlogic.com/packages/python\x7Csrc/archive/p/
py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
2010-02-19 22:00:54,520 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for region
files,python\x7Chttps://olex.openlogic.com/packages/python\x7Csrc/archive/p/
py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429. Current
region memstore size 64.1m
2010-02-19 22:00:54,911 DEBUG org.apache.hadoop.hbase.regionserver.Store:
Added hdfs://dd01:54310/hbase/files/1086732894/content/9096973985255757264,
entries=4486, sequenceid=2812095, memsize=29.5m, filesize=10.8m to
files,python\x7Chttps://olex.openlogic.com/packages/python\x7Csrc/archive/p/
py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
2010-02-19 22:00:54,987 DEBUG org.apache.hadoop.hbase.regionserver.Store:
Added hdfs://dd01:54310/hbase/files/1086732894/metadata/3183633054937023200,
entries=28453, sequenceid=2812095, memsize=8.2m, filesize=638.5k to
files,python\x7Chttps://olex.openlogic.com/packages/python\x7Csrc/archive/p/
py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
2010-02-19 22:00:55,022 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
java.io.IOException: Cannot open filename
/hbase/files/929080390/metadata/6217150884710004337
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497
)
      (...rest of stack trace...)

Reply via email to