I hope someone can help me out. I'm getting started with Hadoop, have written the firt part of my project (a custom InputFormat), and am now using that to test out my cluster setup.
I'm running 0.19.0. I have five dual-core Linux workstations with most of a 250GB disk available for playing, and am controlling things from my Mac Pro. (This is not the production cluster, that hasn't been assembled yet. This is just to get the code working and figure out the bumps.) My test data is about 18GB of web pages, and the test app at the moment just counts the number of web pages in each bundle file. The map jobs run just fine, but when it gets into the reduce, the TaskTrackers all get "lost" to the JobTracker. I can't see why, because the TaskTrackers are all still running on the slaves. Also, the jobdetails URL starts returning an HTTP 500 error, although other links from that page still work. I've tried going onto the slaves and manually restarting the tasktrackers with hadoop-daemon.sh, and also turning on job restarting in the site conf and then running stop-mapred/start-mapred. The trackers start up and try to clean up and get going again, but they then just get lost again. Here's some error output from the master jobtracker: 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200902021252_0002_r_000005_1' from 'tracker_darling:localhost.localdomain/127.0.0.1:58336' 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004592_1 is 796370 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004592_1 timed out. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004582_1 is 794199 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004582_1 timed out. 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/127.0.0.1:52769'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/127.0.0.1:52808'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/127.0.0.1:54464'; Resending the previous 'lost' response 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/127.0.0.1:45749'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54311 caught: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123) at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java :48) at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja va:101) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1 59) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907) 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from unknown Tracker : tracker_monocacy:localhost.localdomain/127.0.0.1:54464 And from a slave: 2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on local exception: null at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211) at org.apache.hadoop.io.UTF8.readString(UTF8.java:203) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:230) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:171) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:219) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:506) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438) 2009-02-02 13:41:40,165 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'rogue' with reponseId '835 2009-02-02 13:41:40,177 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200902021252_0002_r_000007_0: Task attempt_200902021252_0002_r_000007_0 failed to report status for 1600 seconds. Killing! 2009-02-02 13:41:40,287 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 6 bytes for reduce: 4 from map: attempt_200902021252_0002_m_000118_0 given 6/2 from 24 with (-1, -1) 2009-02-02 13:41:40,333 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:39569, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000118_0 2009-02-02 13:41:40,337 INFO org.apache.hadoop.mapred.TaskTracker: Process Thread Dump: lost task ... and follows lots of stacks. I'm not sure which error output is most helpful for this report. Anyway, I can see that the logs tell me that something went wrong on the slave, and as a result the master has lost contact with the slave tasktracker, which is still alive, just not running my job anymore. Can anyone tell me where to start looking? Thanks, Ian