I'm having my job fail on me just before finishing, and after processing for a few hours. All the relevant log info I can find is below, and the only thing I can see as potentially being a problem is that the available space on one of the devices is pretty low. Additionally, 2 map jobs failed because of malformed input, but I would think that hadoop would just ignore those two jobs when trying to complete the reduce phase. Does anyone recognize anything in these logs that would help me identify the problem?

I had a task fail on me with the following message:

07/09/24 15:12:15 INFO mapred.JobClient: Task Id : task_0001_m_004154_1, Status : FAILED 07/09/24 15:12:20 INFO mapred.JobClient: Task Id : task_0001_m_010939_1, Status : FAILED 07/09/24 15:12:24 INFO mapred.JobClient: Task Id : task_0001_m_004154_2, Status : FAILED 07/09/24 15:12:31 INFO mapred.JobClient: Task Id : task_0001_m_010939_2, Status : FAILED
07/09/24 15:12:35 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 604)
        at SearchCount.main(SearchCount.java:168)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.ProgramDriver $ProgramDescription.invoke(ProgramDriver.java:69) at org.apache.hadoop.util.ProgramDriver.driver (ProgramDriver.java:140)
        at SearchDriver.main(SearchDriver.java:34)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

The logs for all of the tasktrackers end in

2007-09-24 15:13:34,205 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call getMapCompletionEvents(job_0001, 17180, 50) from 17.231.8.30:63506: output error
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen (SocketChannelImpl.java:125) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java: 294) at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer (SocketChannelOutputStream.java:108) at org.apache.hadoop.ipc.SocketChannelOutputStream.write (SocketChannelOutputStream.java:89) at java.io.BufferedOutputStream.flushBuffer (BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush (BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:592)

with only the Server handler number differing. Finally, the logs from the jobtracker app for the failed reduce jobs show:

2007-09-24 15:12:36,401 WARN org.apache.hadoop.mapred.ReduceTask: task_0001_r_000000_0 copy failed: task_0001_m_017073_0 from 17.231.8.33 2007-09-24 15:12:36,403 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Server returned HTTP response code: 500 for URL: http://17.231.8.33:50060/mapOutput?map=task_0001_m_017073_0&reduce=0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream (HttpURLConnection.java:1152) at org.apache.hadoop.mapred.MapOutputLocation.getFile (MapOutputLocation.java:206) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier $MapOutputCopier.copyOutput(ReduceTask.java:680) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier $MapOutputCopier.run(ReduceTask.java:641)

Though they actually continue a few lines after that:

2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask: task_0001_r_000000_0 Got 0 new map outputs from tasktracker and 0 map outputs from previous failures 2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask: task_0001_r_000000_0 Got 88 known map output location(s); scheduling... 2007-09-24 15:12:37,394 INFO org.apache.hadoop.mapred.ReduceTask: task_0001_r_000000_0 Scheduled 2 of 88 known outputs (0 slow hosts and 86 dup hosts)

Thanks.

Ross Boucher
[EMAIL PROTECTED]

Reply via email to