I've more information on below hang:

Looking in list of reduce tasks in jobdetails all seem to be hung waiting on one host. Here is the status message that all of my 31 reduce tasks exhibit:

reduce > copy > [EMAIL PROTECTED]:50040

Looking over on the 309 machine, I see that as far as its tasktracker is concerned, the task_m_5g00f1 completed fine:

060308 112952 parsing file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml
060308 112952 task_m_5g00f1 1.0% /user/stack/nara/outputs/segments/2006030721095
20:224000000+32000000
060308 112953 Task task_m_5g00f1 is done.
...

...and then it goes to do cleanup:

060308 122603 task_m_5g00f1 done; removing files.

5 seconds later I start seeing the likes of the below in the log and they are unending:

060308 122626 Server handler 0 on 50040 caught: java.io.FileNotFoundException: / 00f1/part-19.out java.io.FileNotFoundException: /0/hadoop/tmp/task_m_5g00f1/part-19.out at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:113) at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.j at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
   at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:106)
at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
   at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:215)

Looks like ipc Server handlers stuck trying to read parts since removed. Things look a little confused.

Will keep digging... but any suggestions as to what might be going on or things to try appreciated.
St.Ack





Michael Stack wrote:
...

3. Rack looks to be currently 'hung' on invertlinks. All reduce tasks show the exact same 0.25 or so complete. No emissions out of jobtracker in last 4 hours. Namenode log has block reports. All CPUs are quiescent -- even jobtracker. 5 reduce tasks had the below exception:

060308 132336 task_r_9swl2k Client connection to 207.241.228.28:8009: starting 060308 132336 task_r_9swl2k parsing file:/0/hadoop/nara/app/runtime-conf/hadoop-default.xml 060308 132336 task_r_9swl2k parsing file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml 060308 132336 Server connection on port 50050 from 207.241.227.176: starting 060308 132336 task_r_9swl2k 0.75% reduce > reduce060308 132336 task_r_9swl2k Client connection to 0.0.0.0:50050: starting 060308 132339 task_r_9swl2k Error running child060308 132339 task_r_9swl2k java.lang.RuntimeException: java.io.EOFException 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132) 060308 132339 task_r_9swl2k at org.apache.nutch.crawl.LinkDb.reduce(LinkDb.java:108) 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283) 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)
060308 132339 task_r_9swl2k Caused by: java.io.EOFException
060308 132339 task_r_9swl2k at java.io.DataInputStream.readFully(DataInputStream.java:178) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.UTF8.readString(UTF8.java:204) 060308 132339 task_r_9swl2k at org.apache.nutch.crawl.Inlink.readFields(Inlink.java:36) 060308 132339 task_r_9swl2k at org.apache.nutch.crawl.Inlink.read(Inlink.java:53) 060308 132339 task_r_9swl2k at org.apache.nutch.crawl.Inlinks.readFields(Inlinks.java:44) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:347) 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163) 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129)
060308 132339 task_r_9swl2k     ... 3 more
060308 132340 Server connection on port 50050 from 207.241.227.176: exiting060308 132340 KILLING CHILD PROCESS task_r_9swl2k

A thread dump from job tracker shows a bunch of threads in this state:

Full thread dump Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode):
"Server connection on port 8010 from xxx.xxx.xxx.xxx" daemon prio=1 tid=0xad324720 nid=0x2074 runnable [0xac07e000..0xac07ee40] at java.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.read(SocketInputStream.java:129)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
   - locked <0xb4570d08> (a java.io.BufferedInputStream)
   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Server$Connection.run(Server.java:129)

If I connect to the mentioned tasktrackers, the tasktrackers of IP xxx.xxx.xxx.xxx from the jobtracker thread dump, no children are running....

Any pointers appreciated.  Meantime will keep digging.

St.Ack

Reply via email to