How long was your job stuck? The JT should have re-run the map on a different node. Do you see 'fetch failures' messages in the JT logs?
The upcoming hadoop-0.20.204 release (now under discussion/vote) has better logging to help diagnose this in the JT logs. Arun On Aug 3, 2011, at 10:30 AM, Kai Ju Liu wrote: > Hi Arun. A funny thing happened this morning: one of my jobs got stuck with > the "fetch failures" messages that you mentioned. There was one pending map > task remaining and one failed map task that had that error, and the reducers > were stuck at just under 33.3% completion. > > Is there a solution or diagnosis for this situation? I don't know if it's > related to the other issue I've been having, but it would be great to resolve > this one for now. Thanks! > > Kai Ju > > On Tue, Aug 2, 2011 at 10:18 AM, Kai Ju Liu <ka...@tellapart.com> wrote: > All of the reducers are complete, both on the job tracker page and the job > details page. I used to get "fetch failure" messages when HDFS was mounted on > EBS volumes, but I haven't seen any since I migrated to physical disks. > > I'm currently using the fair scheduler, but it doesn't look like I've > specified any allocations. Perhaps I'll dig into this further with the > Cloudera team to see if there is indeed a problem with the job tracker or > scheduler. Otherwise, I'll give 0.20.203 + capacity scheduler a shot. > > Thanks again for the pointers. > > Kai Ju > > > On Mon, Aug 1, 2011 at 10:08 PM, Arun C Murthy <a...@hortonworks.com> wrote: > On Aug 1, 2011, at 9:47 PM, Kai Ju Liu wrote: > >> Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral >> disks, the problem has actually persisted. Now, however, there is no >> evidence of errors on any of the mappers. The job tracker lists one less map >> completed than the map total, while the job details show all mappers as >> having completed. The jobs "hang" in this state as before. > > Are any of your job's reducers completing? Do you see 'fetch failures' > messages either in JT logs or reducers' (tasks) logs? > > If not it's clear that the JobTracker/Scheduler (which Scheduler are you > using btw?) are 'losing' tasks which is a serious bug. You say that you are > running CDH - unfortunately I have no idea what patchsets you run with it. I > can't, at the top of my head, remember the JT/CapacityScheduler losing a task > - but I maintained Yahoo clusters which ran hadoop-0.20.203. > > Here is something worth trying: > $ cat JOBTRACKER.log | grep Assigning | grep _<clustertimestamp>_<jobid>_m_* > > The JOBTRACKER.log is the JT's log file on the JT host and if your jobid is > job_12345342432_0001, then <clustertimestamp> == 12345342432 and <jobid> == > 0001. > > Good luck. > > Arun > >> >> Is there something in particular I should be looking for on my local disks? >> Hadoop fsck shows all clear, but I'll have to wait until morning to take >> individual nodes offline to check their disks. Any further details you might >> have would be very helpful. Thanks! >> >> Kai Ju >> >> On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <a...@hortonworks.com> wrote: >> Is this reproducible? If so, I'd urge you to check your local disks... >> >> Arun >> >> On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote: >> >>> Hi Marcos. The issue appears to be the following. A reduce task is unable >>> to fetch results from a map task on HDFS. The map task is re-run, but the >>> map task is now unable to retrieve information that it needs to run. Here >>> is the error from the second map task: >>> java.io.FileNotFoundException: >>> /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out >>> at >>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176) >>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456) >>> at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205) >>> at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165) >>> at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418) >>> at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) >>> at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179) >>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:396) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) >>> at org.apache.hadoop.mapred.Child.main(Child.java:262) >>> >>> I have been having general difficulties with HDFS on EBS, which pointed me >>> in this direction. Does this sound like a possible hypothesis to you? >>> Thanks! >>> >>> >>> >>> >>> >>> Kai Ju >>> >>> P.S. I am migrating off of HDFS on EBS, so I will post back with further >>> results as soon as I have them. >>> On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <mlor...@uci.cu> wrote: >>> >>> >>> El 7/7/2011 8:43 PM, Kai Ju Liu escribió: >>> >>> Over the past week or two, I've run into an issue where MapReduce jobs >>> hang or fail near completion. The percent completion of both map and >>> reduce tasks is often reported as 100%, but the actual number of >>> completed tasks is less than the total number. It appears that either >>> tasks backtrack and need to be restarted or the last few reduce tasks >>> hang interminably on the copy step. >>> >>> In certain cases, the jobs actually complete. In other cases, I can't >>> wait long enough and have to kill the job manually. >>> >>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4 >>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the >>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0 >>> distribution. Has anyone experienced similar behavior in their clusters, >>> and if so, had any luck resolving it? Thanks! >>> >>> Can you post here your NN and DN logs files? >>> Regards >>> >>> Kai Ju >>> >>> -- >>> Marcos Luís Ortíz Valmaseda >>> Software Engineer (UCI) >>> Linux User # 418229 >>> http://marcosluis2186.posterous.com >>> http://twitter.com/marcosluis2186 >>> >> >> > > >