Re: Failures after a few hours of heavy load

Rod Cope Sat, 20 Feb 2010 14:27:48 -0800

Thanks for the quick help, J-D.  Answers in-line.

> 1. Just to make sure, check out the region server log and grep for
> ulimit as it prints it out when it starts.


Fri Feb 19 13:56:34 MST 2010 Starting regionserver on dd08
ulimit -n 32768

> 
> 2. Did you give HBase more heap? (that won't fix your problem, just making
> sure)

# The maximum amount of heap to use, in MB. Default is 1000.
export HBASE_HEAPSIZE=4000

All boxes have 32GB RAM.

> 
> 3. No need to run 9 ZK quorum members, 3 is fine or even 1 fwiw in your case.

Thanks for the suggestion - I'll go with 3.  I was planning to run 5, but
hadn't gotten around to running a separate ZK quorum (they're managed by
HBase right now).  Would having 9 cause problems?  It doesn't seem to be
making trouble at this point.

> 4. How many CPUs do you have per box? How much tasks are running at
> the same time?

Boxes are each dual quad-core or dual hex-core with 6 hard drives and
they're all dedicated to Hadoop-related work.  The one with the most recent
failure is a dual hex-core.  It was running 7 map jobs (no reduces), ZK,
datanode, tasktracker, regionserver, Stargate, and Tomcat w/2 instances of
Solr.  It's not using any swap (thanks to 32GB RAM).  As far as I can tell,
it was never under too much stress in terms of CPU, memory, or disk - not
sure about network bandwidth, although the 9 boxes are on a dedicated Gbit
switch.

 
> 5. Did you grep the datanodes logs for exceptions?

This is all for the box with the bad regionserver on the latest load:

I found hundreds of these, the first of which is about 1 hour into the load:

2010-02-19 20:43:12,386 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.60.108:50010,
storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075,
ipcPort=50020):Got exception while serving blk_5418714232323545866_193910 to
/192.168.60.101:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.60.108:50010
remote=/192.168.60.101:47012]


A handful of these:

2010-02-19 15:28:41,481 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_4631267981694470746_202618 src: /192.168.60.106:52238 dest:
/192.168.60.108:50010
2010-02-19 15:28:41,482 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_4631267981694470746_202618 received exception
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_4631267981694470746_202618 is valid, and cannot be written to.
2010-02-19 15:28:41,482 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.60.108:50010,
storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075,
ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_4631267981694470746_202618 is valid, and cannot be written to.


And 2 of these within 2 seconds of each other about an hour before the
regionserver starting flaking out:

2010-02-19 20:42:25,252 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.60.108:50010,
storageID=DS-1345361456-192.168.60.108-50010-1266525372855, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.IOException: xceiverCount 1025 exceeds the limit of concurrent
xcievers 1024



Looks like I need to bump up the number of xcievers.  Any thoughts on the
SocketTimeoutExceptions or BlockAlreadyExistsExceptions?

Thanks,
Rod

> On Sat, Feb 20, 2010 at 1:05 PM, Rod Cope <rod.c...@openlogic.com> wrote:
>> Just did this on every box and got pretty much the same answers on each:
>> 
>> [had...@dd08 logs]$ ulimit -n; /usr/sbin/lsof | wc -l
>> 32768
>> 3589
>> 
>> Any other thoughts?
>> 
>> Rod
>> 
>> On 2/20/10 Saturday, February 20, 20101:58 PM, "Dan Washusen"
>> <d...@reactive.org> wrote:
>> 
>>> Sorry I can't be more helpful but just to double check it's not a file
>>> limits issue could you run the following on each of the hosts:
>>> 
>>> $ ulimit -a
>>> $ lsof | wc -l
>>> 
>>> The first command will show you (among other things) the file limits, it
>>> should be above the default 1024.  The second will tell you have many files
>>> are currently open...
>>> 
>>> Cheers,
>>> Dan
>>> 
>>> On 21 February 2010 03:14, Rod Cope <rod.c...@openlogic.com> wrote:
>>> 
>>>> I¹ve been loading some large data sets over the last week or so, but keep
>>>> running into failures between 4 and 15 hours into the process.  I¹ve wiped
>>>> HBase and/or HDFS a few times hoping that would help, but it hasn¹t.
>>>> 
>>>> I¹ve implemented all the recommendations for increasing file limits and the
>>>> like on the troubleshooting wiki page.  There¹s plenty of free disk space
>>>> and memory with no swap being used on any of the 9 machines in the cluster.
>>>> All 9 boxes run a managed ZK, regionserver, datanode, and MR jobs loading
>>>> data from HDFS and NFS-mounted disk into HBase.  Doing a zk_dump shows an
>>>> average of 1 for all machines with the highest max being 621.  The
>>>> regionserver having trouble varies from load to load, so the problem
>>>> doesn¹t
>>>> appear to be machine-specific.
>>>> 
>>>> You can see in the logs below that a compaction is started which leads to a
>>>> LeaseExpiredException: File does not exist (I¹ve done a hadoop get and
>>>> it¹s
>>>> really not there).  Then an Error Recovery for a block, compaction/split
>>>> fail, ³Premeture EOF from inputStream²,  ³No live nodes contain current
>>>> block², and finally ³Cannot open filename².  At this point, there¹s a
>>>> meltdown where the vast majority of the rest of the log is filled with
>>>> exceptions like these back to back.  The regionserver doesn¹t go down,
>>>> however.
>>>> 
>>>> I¹m on the released HBase 0.20.3 with Hadoop 0.20.2 as of yesterday (RC4).
>>>> I upgraded Hadoop from 0.20.1 hoping that would help some of the problems
>>>> I¹ve been having, but it only seemed to change the details of the
>>>> exceptions
>>>> and not the results.  Once I upgraded to Hadoop 0.20.2, I replaced HBase's
>>>> hadoop-0.20.1-hdfs127-core.jar in lib with the new hadoop-0.20.2-core.jar.
>>>> 
>>>> Any ideas?  I¹m really under the gun to get this data loaded, so any
>>>> workarounds or other recommendations are much appreciated.
>>>> 
>>>> Thanks,
>>>> Rod
>>>> 
>>>> ----
>>>> 
>>>> Here¹s a link to the logs below in case they¹re not easy to read:
>>>> http://pastebin.com/d7907bca
>>>> 
>>>> 
>>>> 2010-02-19 21:59:24,950 DEBUG
>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
>>>> requested for region
>>>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk
>>>> \x7Csrc/svn/n/ne/n
>>>> 
>>>> 
erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606>>>>
/
>>>> 25429292 because: Region has references on open
>>>> 2010-02-19 21:59:24,950 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>>>> Starting compaction on region
>>>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk
>>>> \x7Csrc/svn/n/ne/n
>>>> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606
>>>> 2010-02-19 21:59:24,953 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>>>> Started compaction of 4 file(s), hasReferences=true, into
>>>> /hbase/files/compaction.dir/25429292, seqid=2811972
>>>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
>>>> Exception: org.apache.hadoop.ipc.RemoteException:
>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
>>>> /hbase/files/compaction.dir/25429292/2021896477663224037 File does not
>>>> exist. [Lease.  Holder: DFSClient_-1386101021, pendingcreates: 1]
>>>>        at
>>>> 
>>>> 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem>>>>
.
>>>> java:1332)
>>>>      (...rest of stack trace...)
>>>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>> Recovery for block blk_2006633705539782284_253567 bad datanode[0] nodes ==
>>>> null
>>>> 2010-02-19 21:59:27,992 WARN org.apache.hadoop.hdfs.DFSClient: Could not
>>>> get
>>>> block locations. Source file
>>>> "/hbase/files/compaction.dir/25429292/2021896477663224037" - Aborting...
>>>> 2010-02-19 21:59:27,997 ERROR
>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
>>>> failed for region
>>>> files,nerdpass\x7Chttp://nerdpass.googlecode.com/svn/trunk
>>>> \x7Csrc/svn/n/ne/n
>>>> erdpass/application/library/Zend/Server/Reflection/Method.php,1266641963606
>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
>>>> /hbase/files/compaction.dir/25429292/2021896477663224037 File does not
>>>> exist. [Lease.  Holder: DFSClient_-1386101021, pendingcreates: 1]
>>>>        at
>>>> 
>>>> 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem>>>>
.
>>>> java:1332)
>>>>      (...rest of stack trace...)
>>>> 2010-02-19 22:00:23,627 DEBUG
>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>>>> Total=624.38275MB (654712760), Free=172.29224MB (180661512), Max=796.675MB
>>>> (835374272), Counts: Blocks=9977, Access=3726192, Hit=2782447, Miss=943745,
>>>> Evictions=67, Evicted=85131, Ratios: Hit Ratio=74.67266917228699%, Miss
>>>> Ratio=25.327330827713013%, Evicted/Run=1270.6119384765625
>>>> 2010-02-19 22:00:41,978 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>>>> obtain block blk_-5162944092610390422_253522 from any node:
>>>> java.io.IOException: No live nodes contain current block
>>>> 2010-02-19 22:00:44,990 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>>>> obtain block blk_-5162944092610390422_253522 from any node:
>>>> java.io.IOException: No live nodes contain current block
>>>> 2010-02-19 22:00:47,994 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
>>>> java.io.IOException: Cannot open filename
>>>> /hbase/files/929080390/metadata/6217150884710004337
>>>>        at
>>>> 
>>>> 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:149>>>>
7
>>>> )
>>>>      (...rest of stack trace...)
>>>> 2010-02-19 22:00:47,994 ERROR
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException:
>>>> Premeture EOF from inputStream
>>>>        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>>>>      (...rest of stack trace...)
>>>> 2010-02-19 22:00:47,995 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
>>>> handler 76 on 60020, call get([...@3a73f53,
>>>> row=netbeans|
>>>> https://olex.openlogic.com/packages/netbeans|src/archive/n/ne/n
>>>> etbeans/5.0/netbeans-5.0-src/apisupport/l10n.list, maxVersions=1,
>>>> timeRange=[0,9223372036854775807), families={(family=metadata,
>>>> columns={updated_at}}) from 192.168.60.106:45445: error:
>>>> java.io.IOException: Premeture EOF from inputStream
>>>> java.io.IOException: Premeture EOF from inputStream
>>>>      (...rest of stack trace...)
>>>> 2010-02-19 22:00:49,009 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>>>> obtain block blk_-5162944092610390422_253522 from any node:
>>>> java.io.IOException: No live nodes contain current block
>>>> 2010-02-19 22:00:52,019 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>>>> obtain block blk_-5162944092610390422_253522 from any node:
>>>> java.io.IOException: No live nodes contain current block
>>>> 2010-02-19 22:00:54,514 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
>>>> Flush requested on
>>>> files,python\x7Chttps://olex.openlogic.com/packages/python
>>>> \x7Csrc/archive/p/
>>>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
>>>> 2010-02-19 22:00:54,520 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
>>>> Started memstore flush for region
>>>> files,python\x7Chttps://olex.openlogic.com/packages/python
>>>> \x7Csrc/archive/p/
>>>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429. Current
>>>> region memstore size 64.1m
>>>> 2010-02-19 22:00:54,911 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>>>> Added hdfs://dd01:54310/hbase/files/1086732894/content/9096973985255757264,
>>>> entries=4486, sequenceid=2812095, memsize=29.5m, filesize=10.8m to
>>>> files,python\x7Chttps://olex.openlogic.com/packages/python
>>>> \x7Csrc/archive/p/
>>>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
>>>> 2010-02-19 22:00:54,987 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>>>> Added
>>>> hdfs://dd01:54310/hbase/files/1086732894/metadata/3183633054937023200,
>>>> entries=28453, sequenceid=2812095, memsize=8.2m, filesize=638.5k to
>>>> files,python\x7Chttps://olex.openlogic.com/packages/python
>>>> \x7Csrc/archive/p/
>>>> py/python/2.4.6/python-2.4.6-src/Modules/_csv.c,1266641716429
>>>> 2010-02-19 22:00:55,022 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
>>>> java.io.IOException: Cannot open filename
>>>> /hbase/files/929080390/metadata/6217150884710004337
>>>>        at
>>>> 
>>>> 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:149>>>>
7
>>>> )
>>>>      (...rest of stack trace...)
>>>> 
>> 
>> 
>> --
>> 
>> Rod Cope | CTO and Founder
>> rod.c...@openlogic.com
>> Follow me on Twitter @RodCope
>> 
>> 720 240 4501    |  phone
>> 720 240 4557    |  fax
>> 1 888 OpenLogic    |  toll free
>> www.openlogic.com
>> Follow OpenLogic on Twitter @openlogic
>> 
>> 
>> 
>> 
>> 
>> 


-- 

Rod Cope | CTO and Founder
rod.c...@openlogic.com
Follow me on Twitter @RodCope

720 240 4501    |  phone
720 240 4557    |  fax
1 888 OpenLogic    |  toll free
www.openlogic.com 
Follow OpenLogic on Twitter @openlogic

Re: Failures after a few hours of heavy load

Reply via email to