Hi Chris, One thing we've found helping in ext3 is examining your I/O scheduler. Make sure it's set to "deadline", not "CFQ". This will help prevent nodes from being overloaded; when "du -sk" is performed and the node is already overloaded, things quickly roll downhill.
Brian On Mar 29, 2011, at 11:44 AM, Chris Curtin wrote: > We are narrowing this down. The last few times it hung we found a 'du -sk' > process for each our HDFS disks as the top users of CPU. They are also > taking a really long time. > > Searching around I find one example of someone reporting a similar issue > with du -sk, but they tied it to XFS. We are using Ext3. > > Anyone have any other ideas since it appears to be related to the 'du' not > coming back? Note that running the command directly finishes in a few > seconds. > > Thanks, > > Chris > > On Wed, Mar 16, 2011 at 9:41 AM, Chris Curtin <[email protected]>wrote: > >> Caught something today I missed before: >> >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in createBlockOutputStream >> java.io.IOException: Bad connect ack with firstBadLink 10.120.41.105:50010 >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block >> blk_-517003810449127046_10039793 >> 11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node: >> 10.120.41.103:50010 >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream >> java.net.SocketTimeoutException: 69000 millis timeout while waiting for >> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected >> local=/10.120.41.85:34323 remote=/10.120.41.105:50010] >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block >> blk_2153189599588075377_10039793 >> 11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node: >> 10.120.41.105:50010 >> 11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file >> /tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying... >> >> >> >> On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin <[email protected]>wrote: >> >>> Thanks. Spent a lot of time looking at logs and nothing on the reducers >>> until they start complaining about 'could not complete'. >>> >>> Found this in the jobtracker log file: >>> >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: >>> DFSOutputStream ResponseProcessor exception for block >>> blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block >>> blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010 >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454) >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error >>> Recovery for block blk_3829493505250917008_9959810 bad datanode[2] >>> 10.120.41.103:50010 >>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error >>> Recovery for block blk_3829493505250917008_9959810 in pipeline >>> 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad >>> datanode 10.120.41.103:50010 >>> 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not >>> complete file >>> /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T >>> retrying... >>> >>> Looking at the logs from the various times this happens, the 'from >>> datanode' in the first message is any of the data nodes (roughly equal in # >>> of times it fails), so I don't think it is one specific node having >>> problems. >>> Any other ideas? >>> >>> Thanks, >>> >>> Chris >>> On Sun, Mar 13, 2011 at 3:45 AM, icebergs <[email protected]> wrote: >>> >>>> You should check the bad reducers' logs carefully.There may be more >>>> information about it. >>>> >>>> 2011/3/10 Chris Curtin <[email protected]> >>>> >>>>> Hi, >>>>> >>>>> The last couple of days we have been seeing 10's of thousands of these >>>>> errors in the logs: >>>>> >>>>> INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file >>>>> >>>>> >>>> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003 >>>>> retrying... >>>>> When this is going on the reducer in question is always the last >>>> reducer in >>>>> a job. >>>>> >>>>> Sometimes the reducer recovers. Sometimes hadoop kills that reducer, >>>> runs >>>>> another and it succeeds. Sometimes hadoop kills the reducer and the new >>>> one >>>>> also fails, so it gets killed and the cluster goes into a loop of >>>>> kill/launch/kill. >>>>> >>>>> At first we thought it was related to the size of the data being >>>> evaluated >>>>> (4+GB), but we've seen it several times today on < 100 MB >>>>> >>>>> Searching here or online doesn't show a lot about what this error means >>>> and >>>>> how to fix it. >>>>> >>>>> We are running 0.20.2, r911707 >>>>> >>>>> Any suggestions? >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Chris >>>>> >>>> >>> >>> >>
smime.p7s
Description: S/MIME cryptographic signature
