How long was your job stuck?

The JT should have re-run the map on a different node. Do you see 'fetch 
failures' messages in the JT logs?

The upcoming hadoop-0.20.204 release (now under discussion/vote) has better 
logging to help diagnose this in the JT logs.

Arun

On Aug 3, 2011, at 10:30 AM, Kai Ju Liu wrote:

> Hi Arun. A funny thing happened this morning: one of my jobs got stuck with 
> the "fetch failures" messages that you mentioned. There was one pending map 
> task remaining and one failed map task that had that error, and the reducers 
> were stuck at just under 33.3% completion.
> 
> Is there a solution or diagnosis for this situation? I don't know if it's 
> related to the other issue I've been having, but it would be great to resolve 
> this one for now. Thanks!
> 
> Kai Ju
> 
> On Tue, Aug 2, 2011 at 10:18 AM, Kai Ju Liu <ka...@tellapart.com> wrote:
> All of the reducers are complete, both on the job tracker page and the job 
> details page. I used to get "fetch failure" messages when HDFS was mounted on 
> EBS volumes, but I haven't seen any since I migrated to physical disks.
> 
> I'm currently using the fair scheduler, but it doesn't look like I've 
> specified any allocations. Perhaps I'll dig into this further with the 
> Cloudera team to see if there is indeed a problem with the job tracker or 
> scheduler. Otherwise, I'll give 0.20.203 + capacity scheduler a shot.
> 
> Thanks again for the pointers.
> 
> Kai Ju
> 
> 
> On Mon, Aug 1, 2011 at 10:08 PM, Arun C Murthy <a...@hortonworks.com> wrote:
> On Aug 1, 2011, at 9:47 PM, Kai Ju Liu wrote:
> 
>> Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral 
>> disks, the problem has actually persisted. Now, however, there is no 
>> evidence of errors on any of the mappers. The job tracker lists one less map 
>> completed than the map total, while the job details show all mappers as 
>> having completed. The jobs "hang" in this state as before.
> 
> Are any of your job's reducers completing? Do you see 'fetch failures' 
> messages either in JT logs or reducers' (tasks) logs?
> 
> If not it's clear that the JobTracker/Scheduler (which Scheduler are you 
> using btw?) are 'losing' tasks which is a serious bug. You say that you are 
> running CDH - unfortunately I have no idea what patchsets you run with it. I 
> can't, at the top of my head, remember the JT/CapacityScheduler losing a task 
> - but I maintained Yahoo clusters which ran hadoop-0.20.203.
> 
> Here is something worth trying: 
> $ cat JOBTRACKER.log | grep Assigning | grep _<clustertimestamp>_<jobid>_m_*
> 
> The JOBTRACKER.log is the JT's log file on the JT host and if your jobid is 
> job_12345342432_0001, then <clustertimestamp> == 12345342432 and <jobid> == 
> 0001.
> 
> Good luck.
> 
> Arun
> 
>> 
>> Is there something in particular I should be looking for on my local disks? 
>> Hadoop fsck shows all clear, but I'll have to wait until morning to take 
>> individual nodes offline to check their disks. Any further details you might 
>> have would be very helpful. Thanks!
>> 
>> Kai Ju
>> 
>> On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <a...@hortonworks.com> wrote:
>> Is this reproducible? If so, I'd urge you to check your local disks...
>> 
>> Arun
>> 
>> On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote:
>> 
>>> Hi Marcos. The issue appears to be the following. A reduce task is unable 
>>> to fetch results from a map task on HDFS. The map task is re-run, but the 
>>> map task is now unable to retrieve information that it needs to run. Here 
>>> is the error from the second map task:
>>> java.io.FileNotFoundException: 
>>> /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out
>>>     at 
>>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
>>>     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>>>     at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
>>>     at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
>>>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
>>>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>>>     at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>>>     at 
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547)
>>>     at 
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179)
>>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>>>     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>     at javax.security.auth.Subject.doAs(Subject.java:396)
>>>     at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>     at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>> 
>>> I have been having general difficulties with HDFS on EBS, which pointed me 
>>> in this direction. Does this sound like a possible hypothesis to you? 
>>> Thanks!
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Kai Ju
>>> 
>>> P.S. I am migrating off of HDFS on EBS, so I will post back with further 
>>> results as soon as I have them.
>>> On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <mlor...@uci.cu> wrote:
>>> 
>>> 
>>> El 7/7/2011 8:43 PM, Kai Ju Liu escribió:
>>> 
>>> Over the past week or two, I've run into an issue where MapReduce jobs
>>> hang or fail near completion. The percent completion of both map and
>>> reduce tasks is often reported as 100%, but the actual number of
>>> completed tasks is less than the total number. It appears that either
>>> tasks backtrack and need to be restarted or the last few reduce tasks
>>> hang interminably on the copy step.
>>> 
>>> In certain cases, the jobs actually complete. In other cases, I can't
>>> wait long enough and have to kill the job manually.
>>> 
>>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4
>>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the
>>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
>>> distribution. Has anyone experienced similar behavior in their clusters,
>>> and if so, had any luck resolving it? Thanks!
>>> 
>>> Can you post here your NN and DN logs files?
>>> Regards
>>> 
>>> Kai Ju
>>> 
>>> -- 
>>> Marcos Luís Ortíz Valmaseda
>>>  Software Engineer (UCI)
>>>  Linux User # 418229
>>>  http://marcosluis2186.posterous.com
>>>  http://twitter.com/marcosluis2186
>>> 
>> 
>> 
> 
> 
> 

Reply via email to