IOE job exceptions and a 'hang' on invertlinks.

Michael Stack Wed, 08 Mar 2006 10:10:50 -0800

Below are some exceptions gleaned from tasktracker logs. I waswondering if others are seeing similar and if suggestions on how toavoid/fix them.

First, the general setting is a rack of 35 reasonably fast machineslightly loaded running one child per slave with the head node runningjobtracker and namenode only. Am running hadoop revision 383722 --Yesterday or perhaps day before -- and I'm running nutch as my job.


Most settings are default (including ipc timeout).

1. I saw a good few instances of the below during reduce phase for a jobthat runs big files of web pages through nutch parsers (from atasktracker log). Reading code, it looks like NameNode failed toallocate new blocks after trying the maximum of 5 times w/ a sleep of100ms in between each attempt. I'll try and dig deeper to try andelicit why namenode is failing but perhaps others on the list cancomment why?

060308 053947 task_r_b2avsx java.io.IOException: Cannot obtainadditional block for file/user/stack/nara/outputs/segments/20060307210958/crawl_fetch/part-00028/.data.crc060308 053947 task_r_b2avsx atorg.apache.hadoop.ipc.Client.call(Client.java:301)060308 053947 task_r_b2avsx atorg.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)060308 053947 task_r_b2avsx atorg.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)060308 053947 task_r_b2avsx atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:579)060308 053947 task_r_b2avsx atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:788)060308 053947 task_r_b2avsx atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:856)060308 053947 task_r_b2avsx atjava.io.FilterOutputStream.close(FilterOutputStream.java:143)060308 053947 task_r_b2avsx atjava.io.FilterOutputStream.close(FilterOutputStream.java:143)060308 053947 task_r_b2avsx atjava.io.FilterOutputStream.close(FilterOutputStream.java:143)060308 053947 task_r_b2avsx atorg.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:78)060308 053947 task_r_b2avsx atjava.io.FilterOutputStream.close(FilterOutputStream.java:143)060308 053947 task_r_b2avsx atjava.io.FilterOutputStream.close(FilterOutputStream.java:143)060308 053947 task_r_b2avsx atjava.io.FilterOutputStream.close(FilterOutputStream.java:143)060308 053947 task_r_b2avsx atorg.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:137)060308 053947 task_r_b2avsx atorg.apache.hadoop.io.MapFile$Writer.close(MapFile.java:119)060308 053947 task_r_b2avsx atorg.apache.nutch.fetcher.FetcherOutputFormat$1.close(FetcherOutputFormat.java:86)060308 053947 task_r_b2avsx atorg.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:291)060308 053947 task_r_b2avsx atorg.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)

2. I saw just a few instances of the below during the crawldb update buton a slower rack, it happens often. On the slower rack I tried updatingthe ipc timeout but it didn't seem to make a difference. Would beinterested in the thoughts of others on how to avoid below. On slowerrack, it actually stops jobs completing (Why would tasktracker notrespond to child? Tasktrackers aren't doing much, or are they?).


060308 114943 task_r_1qxohh 0.9803678% reduce > reduce
060308 115043 task_r_1qxohh  Parent died.  Exiting task_r_1qxohh

060308 115043 task_r_1qxohh java.io.IOException: timed out waiting forresponse060308 115043 task_r_1qxohh atorg.apache.hadoop.ipc.Client.call(Client.java:303)060308 115043 task_r_1qxohh atorg.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)060308 115043 task_r_1qxohh atorg.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)060308 115043 task_r_1qxohh atorg.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:686)

060308 115043 task_r_1qxohh     at java.lang.Thread.run(Thread.java:595)
060308 115043 task_r_1qxohh 0.9823446% reduce > reduce
060308 115043 Server connection on port 50050 from 207.241.227.179: exiting
060308 115043 Server connection on port 50050 from 207.241.227.179: exiting
060308 115043 KILLING CHILD PROCESS task_r_1qxohh
060308 115043 task_r_1qxohh Child Error
java.io.IOException: Task process exit with nonzero status.

at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:271)at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)


(The 'KILLING...' is my addition).

3. Rack looks to be currently 'hung' on invertlinks. All reduce tasksshow the exact same 0.25 or so complete. No emissions out of jobtrackerin last 4 hours. Namenode log has block reports. All CPUs arequiescent -- even jobtracker. 5 reduce tasks had the below exception:

060308 132336 task_r_9swl2k Client connection to 207.241.228.28:8009:starting060308 132336 task_r_9swl2k parsingfile:/0/hadoop/nara/app/runtime-conf/hadoop-default.xml060308 132336 task_r_9swl2k parsingfile:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml

060308 132336 Server connection on port 50050 from 207.241.227.176: starting

060308 132336 task_r_9swl2k 0.75% reduce > reduce060308 132336task_r_9swl2k Client connection to 0.0.0.0:50050: starting060308 132339 task_r_9swl2k Error running child060308 132339task_r_9swl2k java.lang.RuntimeException: java.io.EOFException060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132)060308 132339 task_r_9swl2k atorg.apache.nutch.crawl.LinkDb.reduce(LinkDb.java:108)060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283)060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)

060308 132339 task_r_9swl2k Caused by: java.io.EOFException

060308 132339 task_r_9swl2k atjava.io.DataInputStream.readFully(DataInputStream.java:178)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.UTF8.readChars(UTF8.java:212)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.UTF8.readString(UTF8.java:204)060308 132339 task_r_9swl2k atorg.apache.nutch.crawl.Inlink.readFields(Inlink.java:36)060308 132339 task_r_9swl2k atorg.apache.nutch.crawl.Inlink.read(Inlink.java:53)060308 132339 task_r_9swl2k atorg.apache.nutch.crawl.Inlinks.readFields(Inlinks.java:44)060308 132339 task_r_9swl2k atorg.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:347)060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163)060308 132339 task_r_9swl2k atorg.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129)

060308 132339 task_r_9swl2k     ... 3 more

060308 132340 Server connection on port 50050 from 207.241.227.176:exiting060308 132340 KILLING CHILD PROCESS task_r_9swl2k


A thread dump from job tracker shows a bunch of threads in this state:

Full thread dump Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode):

"Server connection on port 8010 from xxx.xxx.xxx.xxx" daemon prio=1tid=0xad324720 nid=0x2074 runnable [0xac07e000..0xac07ee40] atjava.net.SocketInputStream.socketRead0(Native Method)

   at java.net.SocketInputStream.read(SocketInputStream.java:129)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
   - locked <0xb4570d08> (a java.io.BufferedInputStream)
   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Server$Connection.run(Server.java:129)

If I connect to the mentioned tasktrackers, the tasktrackers of IPxxx.xxx.xxx.xxx from the jobtracker thread dump, no children are running....


Any pointers appreciated.  Meantime will keep digging.

St.Ack

IOE job exceptions and a 'hang' on invertlinks.

Reply via email to