> Additionally, 2 map jobs failed because of malformed input,
> but I would think that hadoop would just ignore those two
> jobs when trying to complete the reduce phase.
That's not true. Under the default settings, a job will fail even if a
single map/reduce task fails. The number of attempts for a task is set to 4
by default (mapred.{map,reduce}.max.attempts), and the job fails if there
happens to be 4 unsuccessful attempts for any task.
You can, however, tweak the mapred.max.{map,reduce}.failures.percent values
to allow for cases where you expect some percent of maps/reduces to fail.
Note that this config item doesn't appear in hadoop-default.xml (although it
should have).
> -----Original Message-----
> From: Ross Boucher [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, September 25, 2007 4:05 AM
> To: [email protected]
> Subject: Failed Task
>
> I'm having my job fail on me just before finishing, and after
> processing for a few hours. All the relevant log info I can
> find is below, and the only thing I can see as potentially
> being a problem is
> that the available space on one of the devices is pretty low.
> Additionally, 2 map jobs failed because of malformed input,
> but I would think that hadoop would just ignore those two
> jobs when trying to complete the reduce phase. Does anyone
> recognize anything in these logs that would help me identify
> the problem?
>
> I had a task fail on me with the following message:
>
> 07/09/24 15:12:15 INFO mapred.JobClient: Task Id :
> task_0001_m_004154_1, Status : FAILED
> 07/09/24 15:12:20 INFO mapred.JobClient: Task Id :
> task_0001_m_010939_1, Status : FAILED
> 07/09/24 15:12:24 INFO mapred.JobClient: Task Id :
> task_0001_m_004154_2, Status : FAILED
> 07/09/24 15:12:31 INFO mapred.JobClient: Task Id :
> task_0001_m_010939_2, Status : FAILED
> 07/09/24 15:12:35 INFO mapred.JobClient: map 100% reduce 100%
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
> 604)
> at SearchCount.main(SearchCount.java:168)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:585)
> at org.apache.hadoop.util.ProgramDriver
> $ProgramDescription.invoke(ProgramDriver.java:69)
> at org.apache.hadoop.util.ProgramDriver.driver
> (ProgramDriver.java:140)
> at SearchDriver.main(SearchDriver.java:34)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:585)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>
> The logs for all of the tasktrackers end in
>
> 2007-09-24 15:13:34,205 WARN org.apache.hadoop.ipc.Server:
> IPC Server handler 0 on 50050, call
> getMapCompletionEvents(job_0001, 17180, 50) from
> 17.231.8.30:63506: output error
> java.nio.channels.ClosedChannelException
> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen
> (SocketChannelImpl.java:125)
> at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:
> 294)
> at
> org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer
> (SocketChannelOutputStream.java:108)
> at org.apache.hadoop.ipc.SocketChannelOutputStream.write
> (SocketChannelOutputStream.java:89)
> at java.io.BufferedOutputStream.flushBuffer
> (BufferedOutputStream.java:65)
> at java.io.BufferedOutputStream.flush
> (BufferedOutputStream.java:123)
> at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:592)
>
> with only the Server handler number differing. Finally, the
> logs from the jobtracker app for the failed reduce jobs show:
>
> 2007-09-24 15:12:36,401 WARN org.apache.hadoop.mapred.ReduceTask:
> task_0001_r_000000_0 copy failed: task_0001_m_017073_0 from
> 17.231.8.33
> 2007-09-24 15:12:36,403 WARN org.apache.hadoop.mapred.ReduceTask:
> java.io.IOException: Server returned HTTP response code: 500
> for URL:
> http://17.231.8.33:50060/mapOutput?map=task_0001_m_017073_0&reduce=0
> at sun.net.www.protocol.http.HttpURLConnection.getInputStream
> (HttpURLConnection.java:1152)
> at org.apache.hadoop.mapred.MapOutputLocation.getFile
> (MapOutputLocation.java:206)
> at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $MapOutputCopier.copyOutput(ReduceTask.java:680)
> at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $MapOutputCopier.run(ReduceTask.java:641)
>
> Though they actually continue a few lines after that:
>
> 2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:
> task_0001_r_000000_0 Got 0 new map outputs from tasktracker
> and 0 map outputs from previous failures
> 2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:
> task_0001_r_000000_0 Got 88 known map output location(s);
> scheduling...
> 2007-09-24 15:12:37,394 INFO org.apache.hadoop.mapred.ReduceTask:
> task_0001_r_000000_0 Scheduled 2 of 88 known outputs (0 slow
> hosts and 86 dup hosts)
>
> Thanks.
>
> Ross Boucher
> [EMAIL PROTECTED]
>
>