I'm having my job fail on me just before finishing, and after
processing for a few hours. All the relevant log info I can find is
below, and the only thing I can see as potentially being a problem is
that the available space on one of the devices is pretty low.
Additionally, 2 map jobs failed because of malformed input, but I
would think that hadoop would just ignore those two jobs when trying
to complete the reduce phase. Does anyone recognize anything in
these logs that would help me identify the problem?
I had a task fail on me with the following message:
07/09/24 15:12:15 INFO mapred.JobClient: Task Id :
task_0001_m_004154_1, Status : FAILED
07/09/24 15:12:20 INFO mapred.JobClient: Task Id :
task_0001_m_010939_1, Status : FAILED
07/09/24 15:12:24 INFO mapred.JobClient: Task Id :
task_0001_m_004154_2, Status : FAILED
07/09/24 15:12:31 INFO mapred.JobClient: Task Id :
task_0001_m_010939_2, Status : FAILED
07/09/24 15:12:35 INFO mapred.JobClient: map 100% reduce 100%
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
604)
at SearchCount.main(SearchCount.java:168)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.ProgramDriver
$ProgramDescription.invoke(ProgramDriver.java:69)
at org.apache.hadoop.util.ProgramDriver.driver
(ProgramDriver.java:140)
at SearchDriver.main(SearchDriver.java:34)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
The logs for all of the tasktrackers end in
2007-09-24 15:13:34,205 WARN org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 50050, call getMapCompletionEvents(job_0001, 17180, 50)
from 17.231.8.30:63506: output error
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen
(SocketChannelImpl.java:125)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:
294)
at
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer
(SocketChannelOutputStream.java:108)
at org.apache.hadoop.ipc.SocketChannelOutputStream.write
(SocketChannelOutputStream.java:89)
at java.io.BufferedOutputStream.flushBuffer
(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush
(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:592)
with only the Server handler number differing. Finally, the logs
from the jobtracker app for the failed reduce jobs show:
2007-09-24 15:12:36,401 WARN org.apache.hadoop.mapred.ReduceTask:
task_0001_r_000000_0 copy failed: task_0001_m_017073_0 from 17.231.8.33
2007-09-24 15:12:36,403 WARN org.apache.hadoop.mapred.ReduceTask:
java.io.IOException: Server returned HTTP response code: 500 for URL:
http://17.231.8.33:50060/mapOutput?map=task_0001_m_017073_0&reduce=0
at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1152)
at org.apache.hadoop.mapred.MapOutputLocation.getFile
(MapOutputLocation.java:206)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$MapOutputCopier.copyOutput(ReduceTask.java:680)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$MapOutputCopier.run(ReduceTask.java:641)
Though they actually continue a few lines after that:
2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:
task_0001_r_000000_0 Got 0 new map outputs from tasktracker and 0 map
outputs from previous failures
2007-09-24 15:12:37,393 INFO org.apache.hadoop.mapred.ReduceTask:
task_0001_r_000000_0 Got 88 known map output location(s); scheduling...
2007-09-24 15:12:37,394 INFO org.apache.hadoop.mapred.ReduceTask:
task_0001_r_000000_0 Scheduled 2 of 88 known outputs (0 slow hosts
and 86 dup hosts)
Thanks.
Ross Boucher
[EMAIL PROTECTED]