Below are some exceptions gleaned from tasktracker logs. I was
wondering if others are seeing similar and if suggestions on how to
avoid/fix them.
First, the general setting is a rack of 35 reasonably fast machines
lightly loaded running one child per slave with the head node running
jobtracker and namenode only. Am running hadoop revision 383722 --
Yesterday or perhaps day before -- and I'm running nutch as my job.
Most settings are default (including ipc timeout).
1. I saw a good few instances of the below during reduce phase for a job
that runs big files of web pages through nutch parsers (from a
tasktracker log). Reading code, it looks like NameNode failed to
allocate new blocks after trying the maximum of 5 times w/ a sleep of
100ms in between each attempt. I'll try and dig deeper to try and
elicit why namenode is failing but perhaps others on the list can
comment why?
060308 053947 task_r_b2avsx java.io.IOException: Cannot obtain
additional block for file
/user/stack/nara/outputs/segments/20060307210958/crawl_fetch/part-00028/.data.crc
060308 053947 task_r_b2avsx at
org.apache.hadoop.ipc.Client.call(Client.java:301)
060308 053947 task_r_b2avsx at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
060308 053947 task_r_b2avsx at
org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)
060308 053947 task_r_b2avsx at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:579)
060308 053947 task_r_b2avsx at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:788)
060308 053947 task_r_b2avsx at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:856)
060308 053947 task_r_b2avsx at
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx at
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx at
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx at
org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:78)
060308 053947 task_r_b2avsx at
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx at
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx at
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx at
org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:137)
060308 053947 task_r_b2avsx at
org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:119)
060308 053947 task_r_b2avsx at
org.apache.nutch.fetcher.FetcherOutputFormat$1.close(FetcherOutputFormat.java:86)
060308 053947 task_r_b2avsx at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:291)
060308 053947 task_r_b2avsx at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)
2. I saw just a few instances of the below during the crawldb update but
on a slower rack, it happens often. On the slower rack I tried updating
the ipc timeout but it didn't seem to make a difference. Would be
interested in the thoughts of others on how to avoid below. On slower
rack, it actually stops jobs completing (Why would tasktracker not
respond to child? Tasktrackers aren't doing much, or are they?).
060308 114943 task_r_1qxohh 0.9803678% reduce > reduce
060308 115043 task_r_1qxohh Parent died. Exiting task_r_1qxohh
060308 115043 task_r_1qxohh java.io.IOException: timed out waiting for
response
060308 115043 task_r_1qxohh at
org.apache.hadoop.ipc.Client.call(Client.java:303)
060308 115043 task_r_1qxohh at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
060308 115043 task_r_1qxohh at
org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)
060308 115043 task_r_1qxohh at
org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:686)
060308 115043 task_r_1qxohh at java.lang.Thread.run(Thread.java:595)
060308 115043 task_r_1qxohh 0.9823446% reduce > reduce
060308 115043 Server connection on port 50050 from 207.241.227.179: exiting
060308 115043 Server connection on port 50050 from 207.241.227.179: exiting
060308 115043 KILLING CHILD PROCESS task_r_1qxohh
060308 115043 task_r_1qxohh Child Error
java.io.IOException: Task process exit with nonzero status.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:271)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)
(The 'KILLING...' is my addition).
3. Rack looks to be currently 'hung' on invertlinks. All reduce tasks
show the exact same 0.25 or so complete. No emissions out of jobtracker
in last 4 hours. Namenode log has block reports. All CPUs are
quiescent -- even jobtracker. 5 reduce tasks had the below exception:
060308 132336 task_r_9swl2k Client connection to 207.241.228.28:8009:
starting
060308 132336 task_r_9swl2k parsing
file:/0/hadoop/nara/app/runtime-conf/hadoop-default.xml
060308 132336 task_r_9swl2k parsing
file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml
060308 132336 Server connection on port 50050 from 207.241.227.176: starting
060308 132336 task_r_9swl2k 0.75% reduce > reduce060308 132336
task_r_9swl2k Client connection to 0.0.0.0:50050: starting
060308 132339 task_r_9swl2k Error running child060308 132339
task_r_9swl2k java.lang.RuntimeException: java.io.EOFException
060308 132339 task_r_9swl2k at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132)
060308 132339 task_r_9swl2k at
org.apache.nutch.crawl.LinkDb.reduce(LinkDb.java:108)
060308 132339 task_r_9swl2k at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283)
060308 132339 task_r_9swl2k at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)
060308 132339 task_r_9swl2k Caused by: java.io.EOFException
060308 132339 task_r_9swl2k at
java.io.DataInputStream.readFully(DataInputStream.java:178)
060308 132339 task_r_9swl2k at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060308 132339 task_r_9swl2k at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060308 132339 task_r_9swl2k at
org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
060308 132339 task_r_9swl2k at
org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
060308 132339 task_r_9swl2k at
org.apache.nutch.crawl.Inlink.readFields(Inlink.java:36)
060308 132339 task_r_9swl2k at
org.apache.nutch.crawl.Inlink.read(Inlink.java:53)
060308 132339 task_r_9swl2k at
org.apache.nutch.crawl.Inlinks.readFields(Inlinks.java:44)
060308 132339 task_r_9swl2k at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:347)
060308 132339 task_r_9swl2k at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163)
060308 132339 task_r_9swl2k at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129)
060308 132339 task_r_9swl2k ... 3 more
060308 132340 Server connection on port 50050 from 207.241.227.176:
exiting060308 132340 KILLING CHILD PROCESS task_r_9swl2k
A thread dump from job tracker shows a bunch of threads in this state:
Full thread dump Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode):
"Server connection on port 8010 from xxx.xxx.xxx.xxx" daemon prio=1
tid=0xad324720 nid=0x2074 runnable [0xac07e000..0xac07ee40] at
java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
- locked <0xb4570d08> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:353)
at org.apache.hadoop.ipc.Server$Connection.run(Server.java:129)
If I connect to the mentioned tasktrackers, the tasktrackers of IP
xxx.xxx.xxx.xxx from the jobtracker thread dump, no children are running....
Any pointers appreciated. Meantime will keep digging.
St.Ack