More about 'hang' on invertlinks.

Michael Stack Wed, 08 Mar 2006 12:22:41 -0800

I've more information on below hang:

Looking in list of reduce tasks in jobdetails all seem to be hungwaiting on one host. Here is the status message that all of my 31reduce tasks exhibit:


reduce > copy > [EMAIL PROTECTED]:50040

Looking over on the 309 machine, I see that as far as its tasktracker isconcerned, the task_m_5g00f1 completed fine:


060308 112952 parsing file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml

060308 112952 task_m_5g00f1 1.0%/user/stack/nara/outputs/segments/2006030721095

20:224000000+32000000
060308 112953 Task task_m_5g00f1 is done.
...

...and then it goes to do cleanup:

060308 122603 task_m_5g00f1 done; removing files.

5 seconds later I start seeing the likes of the below in the log andthey are unending:

060308 122626 Server handler 0 on 50040 caught:java.io.FileNotFoundException: /00f1/part-19.outjava.io.FileNotFoundException: /0/hadoop/tmp/task_m_5g00f1/part-19.outatorg.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:113)atorg.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.jatorg.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)

   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
   at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:106)

atorg.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)

   at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:215)

Looks like ipc Server handlers stuck trying to read parts sinceremoved. Things look a little confused.

Will keep digging... but any suggestions as to what might be going on orthings to try appreciated.

St.Ack





Michael Stack wrote:

...
3. Rack looks to be currently 'hung' on invertlinks. All reduce tasksshow the exact same 0.25 or so complete. No emissions out ofjobtracker in last 4 hours. Namenode log has block reports. All CPUsare quiescent -- even jobtracker. 5 reduce tasks had the belowexception:
060308 132336 task_r_9swl2k Client connection to 207.241.228.28:8009:starting060308 132336 task_r_9swl2k parsingfile:/0/hadoop/nara/app/runtime-conf/hadoop-default.xml060308 132336 task_r_9swl2k parsingfile:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml060308 132336 Server connection on port 50050 from 207.241.227.176:starting060308 132336 task_r_9swl2k 0.75% reduce > reduce060308 132336task_r_9swl2k Client connection to 0.0.0.0:50050: starting060308 132339 task_r_9swl2k Error running child060308 132339task_r_9swl2k java.lang.RuntimeException: java.io.EOFException060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132)060308 132339 task_r_9swl2k atorg.apache.nutch.crawl.LinkDb.reduce(LinkDb.java:108)060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283)060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)
060308 132339 task_r_9swl2k Caused by: java.io.EOFException
060308 132339 task_r_9swl2k atjava.io.DataInputStream.readFully(DataInputStream.java:178)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.UTF8.readChars(UTF8.java:212)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.UTF8.readString(UTF8.java:204)060308 132339 task_r_9swl2k atorg.apache.nutch.crawl.Inlink.readFields(Inlink.java:36)060308 132339 task_r_9swl2k atorg.apache.nutch.crawl.Inlink.read(Inlink.java:53) 060308 132339task_r_9swl2k atorg.apache.nutch.crawl.Inlinks.readFields(Inlinks.java:44)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:347)060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163)060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129)
060308 132339 task_r_9swl2k     ... 3 more
060308 132340 Server connection on port 50050 from 207.241.227.176:exiting060308 132340 KILLING CHILD PROCESS task_r_9swl2k
A thread dump from job tracker shows a bunch of threads in this state:

Full thread dump Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode):
"Server connection on port 8010 from xxx.xxx.xxx.xxx" daemon prio=1tid=0xad324720 nid=0x2074 runnable [0xac07e000..0xac07ee40] atjava.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.read(SocketInputStream.java:129)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
   - locked <0xb4570d08> (a java.io.BufferedInputStream)
   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Server$Connection.run(Server.java:129)
If I connect to the mentioned tasktrackers, the tasktrackers of IPxxx.xxx.xxx.xxx from the jobtracker thread dump, no children arerunning....
Any pointers appreciated.  Meantime will keep digging.

St.Ack

More about 'hang' on invertlinks.

Reply via email to