My tasktrackers keep getting lost...

Ian Soboroff Mon, 02 Feb 2009 19:10:21 -0800

I hope someone can help me out.  I'm getting started with Hadoop, 
have written the firt part of my project (a custom InputFormat), and am
now using that to test out my cluster setup.


I'm running 0.19.0.  I have five dual-core Linux workstations with most
of a 250GB disk available for playing, and am controlling things from my
Mac Pro.  (This is not the production cluster, that hasn't been
assembled yet.  This is just to get the code working and figure out the
bumps.)

My test data is about 18GB of web pages, and the test app at the moment
just counts the number of web pages in each bundle file.  The map jobs
run just fine, but when it gets into the reduce, the TaskTrackers all
get "lost" to the JobTracker.  I can't see why, because the TaskTrackers
are all still running on the slaves.  Also, the jobdetails URL starts
returning an HTTP 500 error, although other links from that page still
work.

I've tried going onto the slaves and manually restarting the
tasktrackers with hadoop-daemon.sh, and also turning on job restarting
in the site conf and then running stop-mapred/start-mapred.  The
trackers start up and try to clean up and get going again, but they then
just get lost again.

Here's some error output from the master jobtracker:

2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'attempt_200902021252_0002_r_000005_1' from 
'tracker_darling:localhost.localdomain/127.0.0.1:58336'
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: 
attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching 
task attempt_200902021252_0002_m_004592_1 timed out.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: 
attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching 
task attempt_200902021252_0002_m_004582_1 timed out.
2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_cheyenne:localhost.localdomain/127.0.0.1:52769'; resending the 
previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_tigris:localhost.localdomain/127.0.0.1:52808'; resending the previous 
'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_monocacy:localhost.localdomain/127.0.0.1:54464'; Resending the 
previous 'lost' response
2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; 
resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_rhone:localhost.localdomain/127.0.0.1:45749'; resending the previous 
'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9
on 54311 caught: java.lang.NullPointerException
        at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
        at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
:48)
        at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
va:101)
        at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
59)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)

2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from 
unknown Tracker : tracker_monocacy:localhost.localdomain/127.0.0.1:54464

And from a slave:

2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: 
src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: 
MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0
2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on 
local exception: null
        at org.apache.hadoop.ipc.Client.call(Client.java:699)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
        at 
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
        at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211)
        at org.apache.hadoop.io.UTF8.readString(UTF8.java:203)
        at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:230)
        at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:171)
        at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:219)
        at 
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:506)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438)

2009-02-02 13:41:40,165 INFO org.apache.hadoop.mapred.TaskTracker: Resending 
'status' to 'rogue' with reponseId '835
2009-02-02 13:41:40,177 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200902021252_0002_r_000007_0: Task attempt_200902021252_0002_r_000007_0 
failed to report status for 1600 seconds. Killing!
2009-02-02 13:41:40,287 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 6 
bytes for reduce: 4 from map: attempt_200902021252_0002_m_000118_0 given 6/2 
from 24 with (-1, -1)
2009-02-02 13:41:40,333 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: 
src: 129.6.101.18:50060, dest: 129.6.101.12:39569, bytes: 6, op: 
MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000118_0
2009-02-02 13:41:40,337 INFO org.apache.hadoop.mapred.TaskTracker: Process 
Thread Dump: lost task

... and follows lots of stacks.  I'm not sure which error output is most
helpful for this report.

Anyway, I can see that the logs tell me that something went wrong on the
slave, and as a result the master has lost contact with the slave
tasktracker, which is still alive, just not running my job anymore.  Can
anyone tell me where to start looking?

Thanks,
Ian

My tasktrackers keep getting lost...

Reply via email to