Below are some exceptions gleaned from tasktracker logs. I was wondering if others are seeing similar and if suggestions on how to avoid/fix them.

First, the general setting is a rack of 35 reasonably fast machines lightly loaded running one child per slave with the head node running jobtracker and namenode only. Am running hadoop revision 383722 -- Yesterday or perhaps day before -- and I'm running nutch as my job.

Most settings are default (including ipc timeout).

1. I saw a good few instances of the below during reduce phase for a job that runs big files of web pages through nutch parsers (from a tasktracker log). Reading code, it looks like NameNode failed to allocate new blocks after trying the maximum of 5 times w/ a sleep of 100ms in between each attempt. I'll try and dig deeper to try and elicit why namenode is failing but perhaps others on the list can comment why?

060308 053947 task_r_b2avsx java.io.IOException: Cannot obtain additional block for file /user/stack/nara/outputs/segments/20060307210958/crawl_fetch/part-00028/.data.crc 060308 053947 task_r_b2avsx at org.apache.hadoop.ipc.Client.call(Client.java:301) 060308 053947 task_r_b2avsx at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) 060308 053947 task_r_b2avsx at org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source) 060308 053947 task_r_b2avsx at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:579) 060308 053947 task_r_b2avsx at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:788) 060308 053947 task_r_b2avsx at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:856) 060308 053947 task_r_b2avsx at java.io.FilterOutputStream.close(FilterOutputStream.java:143) 060308 053947 task_r_b2avsx at java.io.FilterOutputStream.close(FilterOutputStream.java:143) 060308 053947 task_r_b2avsx at java.io.FilterOutputStream.close(FilterOutputStream.java:143) 060308 053947 task_r_b2avsx at org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:78) 060308 053947 task_r_b2avsx at java.io.FilterOutputStream.close(FilterOutputStream.java:143) 060308 053947 task_r_b2avsx at java.io.FilterOutputStream.close(FilterOutputStream.java:143) 060308 053947 task_r_b2avsx at java.io.FilterOutputStream.close(FilterOutputStream.java:143) 060308 053947 task_r_b2avsx at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:137) 060308 053947 task_r_b2avsx at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:119) 060308 053947 task_r_b2avsx at org.apache.nutch.fetcher.FetcherOutputFormat$1.close(FetcherOutputFormat.java:86) 060308 053947 task_r_b2avsx at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:291) 060308 053947 task_r_b2avsx at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)


2. I saw just a few instances of the below during the crawldb update but on a slower rack, it happens often. On the slower rack I tried updating the ipc timeout but it didn't seem to make a difference. Would be interested in the thoughts of others on how to avoid below. On slower rack, it actually stops jobs completing (Why would tasktracker not respond to child? Tasktrackers aren't doing much, or are they?).

060308 114943 task_r_1qxohh 0.9803678% reduce > reduce
060308 115043 task_r_1qxohh  Parent died.  Exiting task_r_1qxohh
060308 115043 task_r_1qxohh java.io.IOException: timed out waiting for response 060308 115043 task_r_1qxohh at org.apache.hadoop.ipc.Client.call(Client.java:303) 060308 115043 task_r_1qxohh at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) 060308 115043 task_r_1qxohh at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) 060308 115043 task_r_1qxohh at org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:686)
060308 115043 task_r_1qxohh     at java.lang.Thread.run(Thread.java:595)
060308 115043 task_r_1qxohh 0.9823446% reduce > reduce
060308 115043 Server connection on port 50050 from 207.241.227.179: exiting
060308 115043 Server connection on port 50050 from 207.241.227.179: exiting
060308 115043 KILLING CHILD PROCESS task_r_1qxohh
060308 115043 task_r_1qxohh Child Error
java.io.IOException: Task process exit with nonzero status.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:271) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)

(The 'KILLING...' is my addition).

3. Rack looks to be currently 'hung' on invertlinks. All reduce tasks show the exact same 0.25 or so complete. No emissions out of jobtracker in last 4 hours. Namenode log has block reports. All CPUs are quiescent -- even jobtracker. 5 reduce tasks had the below exception:

060308 132336 task_r_9swl2k Client connection to 207.241.228.28:8009: starting 060308 132336 task_r_9swl2k parsing file:/0/hadoop/nara/app/runtime-conf/hadoop-default.xml 060308 132336 task_r_9swl2k parsing file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml
060308 132336 Server connection on port 50050 from 207.241.227.176: starting
060308 132336 task_r_9swl2k 0.75% reduce > reduce060308 132336 task_r_9swl2k Client connection to 0.0.0.0:50050: starting 060308 132339 task_r_9swl2k Error running child060308 132339 task_r_9swl2k java.lang.RuntimeException: java.io.EOFException 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132) 060308 132339 task_r_9swl2k at org.apache.nutch.crawl.LinkDb.reduce(LinkDb.java:108) 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283) 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)
060308 132339 task_r_9swl2k Caused by: java.io.EOFException
060308 132339 task_r_9swl2k at java.io.DataInputStream.readFully(DataInputStream.java:178) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.UTF8.readString(UTF8.java:204) 060308 132339 task_r_9swl2k at org.apache.nutch.crawl.Inlink.readFields(Inlink.java:36) 060308 132339 task_r_9swl2k at org.apache.nutch.crawl.Inlink.read(Inlink.java:53) 060308 132339 task_r_9swl2k at org.apache.nutch.crawl.Inlinks.readFields(Inlinks.java:44) 060308 132339 task_r_9swl2k at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:347) 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163) 060308 132339 task_r_9swl2k at org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129)
060308 132339 task_r_9swl2k     ... 3 more
060308 132340 Server connection on port 50050 from 207.241.227.176: exiting060308 132340 KILLING CHILD PROCESS task_r_9swl2k

A thread dump from job tracker shows a bunch of threads in this state:

Full thread dump Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode):
"Server connection on port 8010 from xxx.xxx.xxx.xxx" daemon prio=1 tid=0xad324720 nid=0x2074 runnable [0xac07e000..0xac07ee40] at java.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.read(SocketInputStream.java:129)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
   - locked <0xb4570d08> (a java.io.BufferedInputStream)
   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Server$Connection.run(Server.java:129)

If I connect to the mentioned tasktrackers, the tasktrackers of IP xxx.xxx.xxx.xxx from the jobtracker thread dump, no children are running....

Any pointers appreciated.  Meantime will keep digging.

St.Ack

Reply via email to