Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral disks, the problem has actually persisted. Now, however, there is no evidence of errors on any of the mappers. The job tracker lists one less map completed than the map total, while the job details show all mappers as having completed. The jobs "hang" in this state as before.
Is there something in particular I should be looking for on my local disks? Hadoop fsck shows all clear, but I'll have to wait until morning to take individual nodes offline to check their disks. Any further details you might have would be very helpful. Thanks! Kai Ju On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <a...@hortonworks.com> wrote: > Is this reproducible? If so, I'd urge you to check your local disks... > > Arun > > On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote: > > Hi Marcos. The issue appears to be the following. A reduce task is unable > to fetch results from a map task on HDFS. The map task is re-run, but the > map task is now unable to retrieve information that it needs to run. Here is > the error from the second map task: > > java.io.FileNotFoundException: > /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out > at > org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456) > at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205) > at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) > at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) > at org.apache.hadoop.mapred.Child.main(Child.java:262) > > I have been having general difficulties with HDFS on EBS, which pointed me in > this direction. Does this sound like a possible hypothesis to you? Thanks! > > Kai Ju > > P.S. I am migrating off of HDFS on EBS, so I will post back with further > results as soon as I have them. > > On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <mlor...@uci.cu> wrote: > >> >> >> El 7/7/2011 8:43 PM, Kai Ju Liu escribió: >> >> Over the past week or two, I've run into an issue where MapReduce jobs >>> hang or fail near completion. The percent completion of both map and >>> reduce tasks is often reported as 100%, but the actual number of >>> completed tasks is less than the total number. It appears that either >>> tasks backtrack and need to be restarted or the last few reduce tasks >>> hang interminably on the copy step. >>> >>> In certain cases, the jobs actually complete. In other cases, I can't >>> wait long enough and have to kill the job manually. >>> >>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4 >>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the >>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0 >>> distribution. Has anyone experienced similar behavior in their clusters, >>> and if so, had any luck resolving it? Thanks! >>> >>> Can you post here your NN and DN logs files? >> Regards >> >> Kai Ju >>> >> >> -- >> Marcos Luís Ortíz Valmaseda >> Software Engineer (UCI) >> Linux User # 418229 >> http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com/> >> http://twitter.com/**marcosluis2186 <http://twitter.com/marcosluis2186> >> >> >