RE: Failed Task

Devaraj Das Tue, 25 Sep 2007 00:07:16 -0700

> Additionally, 2 map jobs failed because of malformed input, 
> but I would think that hadoop would just ignore those two 
> jobs when trying to complete the reduce phase.


That's not true. Under the default settings, a job will fail even if a
single map/reduce task fails. The number of attempts for a task is set to 4
by default (mapred.{map,reduce}.max.attempts), and the job fails if there
happens to be 4 unsuccessful attempts for any task. 

You can, however, tweak the mapred.max.{map,reduce}.failures.percent values
to allow for cases where you expect some percent of maps/reduces to fail.
Note that this config item doesn't appear in hadoop-default.xml (although it
should have).

> -----Original Message-----
> From: Ross Boucher [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, September 25, 2007 4:05 AM
> To: [email protected]
> Subject: Failed Task
> 
> I'm having my job fail on me just before finishing, and after 
> processing for a few hours.  All the relevant log info I can 
> find is below, and the only thing I can see as potentially 
> being a problem is  
> that the available space on one of the devices is pretty low.   
> Additionally, 2 map jobs failed because of malformed input, 
> but I would think that hadoop would just ignore those two 
> jobs when trying to complete the reduce phase.  Does anyone 
> recognize anything in these logs that would help me identify 
> the problem?
> 
> I had a task fail on me with the following message:
> 
> 07/09/24 15:12:15 INFO mapred.JobClient: Task Id :  
> task_0001_m_004154_1, Status : FAILED
> 07/09/24 15:12:20 INFO mapred.JobClient: Task Id :  
> task_0001_m_010939_1, Status : FAILED
> 07/09/24 15:12:24 INFO mapred.JobClient: Task Id :  
> task_0001_m_004154_2, Status : FAILED
> 07/09/24 15:12:31 INFO mapred.JobClient: Task Id :  
> task_0001_m_010939_2, Status : FAILED
> 07/09/24 15:12:35 INFO mapred.JobClient:  map 100% reduce 100%
> java.io.IOException: Job failed!
>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
> 604)
>          at SearchCount.main(SearchCount.java:168)
>          at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
>          at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:585)
>          at org.apache.hadoop.util.ProgramDriver
> $ProgramDescription.invoke(ProgramDriver.java:69)
>          at org.apache.hadoop.util.ProgramDriver.driver
> (ProgramDriver.java:140)
>          at SearchDriver.main(SearchDriver.java:34)
>          at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
>          at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:585)
>          at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
> 
> The logs for all of the tasktrackers end in
> 
> 2007-09-24 15:13:34,205 WARN org.apache.hadoop.ipc.Server: 
> IPC Server handler 0 on 50050, call 
> getMapCompletionEvents(job_0001, 17180, 50) from 
> 17.231.8.30:63506: output error 
> java.nio.channels.ClosedChannelException
>          at sun.nio.ch.SocketChannelImpl.ensureWriteOpen
> (SocketChannelImpl.java:125)
>          at 
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java: 
> 294)
>          at
> org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer
> (SocketChannelOutputStream.java:108)
>          at org.apache.hadoop.ipc.SocketChannelOutputStream.write
> (SocketChannelOutputStream.java:89)
>          at java.io.BufferedOutputStream.flushBuffer
> (BufferedOutputStream.java:65)
>          at java.io.BufferedOutputStream.flush
> (BufferedOutputStream.java:123)
>          at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:592)
> 
> with only the Server handler number differing.  Finally, the 
> logs from the jobtracker app for the failed reduce jobs show:
> 
> 2007-09-24 15:12:36,401 WARN org.apache.hadoop.mapred.ReduceTask:  
> task_0001_r_000000_0 copy failed: task_0001_m_017073_0 from 
> 17.231.8.33
> 2007-09-24 15:12:36,403 WARN org.apache.hadoop.mapred.ReduceTask:  
> java.io.IOException: Server returned HTTP response code: 500 
> for URL:  
> http://17.231.8.33:50060/mapOutput?map=task_0001_m_017073_0&reduce=0
>       at sun.net.www.protocol.http.HttpURLConnection.getInputStream
> (HttpURLConnection.java:1152)
>       at org.apache.hadoop.mapred.MapOutputLocation.getFile
> (MapOutputLocation.java:206)
>       at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $MapOutputCopier.copyOutput(ReduceTask.java:680)
>       at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $MapOutputCopier.run(ReduceTask.java:641)
> 
> Though they actually continue a few lines after that:
> 
> 2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:  
> task_0001_r_000000_0 Got 0 new map outputs from tasktracker 
> and 0 map outputs from previous failures
> 2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:  
> task_0001_r_000000_0 Got 88 known map output location(s); 
> scheduling...
> 2007-09-24 15:12:37,394 INFO org.apache.hadoop.mapred.ReduceTask:  
> task_0001_r_000000_0 Scheduled 2 of 88 known outputs (0 slow 
> hosts and 86 dup hosts)
> 
> Thanks.
> 
> Ross Boucher
> [EMAIL PROTECTED]
> 
>

RE: Failed Task

Reply via email to