Gang,

At the risk of incurring cross-posting ire (and based on a suggestion from Stefan), I'm posting this to nutch-dev as well:

We're now running into "No node available for block <blockID>" errors, which are killing our MapReduce-based crawling jobs. I did some digging through our logs after one of these events to build a context for the problem, so I thought I'd share the results with the rest of you.

Our crawler died during the reduce part of the crawldb update (third time around fetch loop). The ipc.client.timeout value was set to 1.8M (30 minutes). The job eventually dies because a map task is unable to get a block that it needs. However, this block apparently does exist on the DataNode and is successfully transferred for some reason 13 minutes after the job gets aborted. A number of "Read timed out" and "EOF reading from Socket" errors do appear in the DataNode log before the problem first surfaces. However, there's a lot of successful activity as well.

Afterward, I was able to successfully update the crawldb using the command-line tool (after a stop-all.sh/start-all.sh sequence for good measure), and I restarted the crawler. It ran for a couple of generate/fetch/update loops before it finally died during URL partitioning, again due to "No node available for block <blockID>" errors. These errors referred to different blocks on a different DataNode slave, but the blocks were apparently there on the DataNode, as evidenced by later log activity. Again, a stop-all.sh/start-all.sh sequence apparently put the crawler back in business, and it's still running now.

Any light that could be shed on the underlying problem(s) would be greatly appreciated.

Thanks,

- Chris

11:45:24pm - The crawl begins:

bin/do_crawl.sh started at 20060204234524...

12:17:17am - A series of read timeout errors appear in the DataNode log on s2. No evidence of these problems appears in the NameNode log.

060205 001717 2457 DataXCeiver
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
        at java.io.FilterInputStream.read(FilterInputStream.java:66)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:293)
        at java.lang.Thread.run(Thread.java:595)

...snip...

060205 001803 2523 DataXCeiver
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:235) at java.io.FilterInputStream.read(FilterInputStream.java:66)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:293)
        at java.lang.Thread.run(Thread.java:595)

12:19:56am - An EOF error appears in the DataNode log on s2. No evidence of this problem appears in the NameNode log:

060205 001956 2709 DataXCeiver
java.io.EOFException: EOF reading from Socket[addr=/192.168.1.9,port=42878,localport=50010]
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
        at java.lang.Thread.run(Thread.java:595)

12:50:57am - Another series of read timeout errors appear in the DataNode log on s2.

12:53:40am - Another EOF error appears in the DataNode log on s2.

1:27:21am - Another series of read timeout errors appear in the DataNode log on s2. A few EOF errors are sprinkled in as well.

1:29:13am - We see the first mention of blk_1084437997772174917 in the DataNode log on s2. It appears to have received this block from 192.168.1.5 (s4):

060205 012913 4178 Received block blk_1084437997772174917 from /192.168.1.5

1:29:22am - A series of "Block <blockID> is valid, and cannot be written to" errors appear in the DataNode log on s2. None of these refer to blk_1084437997772174917, however:

060205 012922 4205 DataXCeiver
java.io.IOException: Block blk_-4636625784756138174 is valid, and cannot be written to.
        at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
        at java.lang.Thread.run(Thread.java:595)

...snip...

060205 012923 4212 DataXCeiver
java.io.EOFException: EOF reading from Socket[addr=/192.168.1.4,port=44881,localport=50010]
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
        at java.lang.Thread.run(Thread.java:595)
060205 012924 4213 Received block blk_-7873577862877309136 from /192.168.1.2
060205 012925 4214 DataXCeiver
java.io.EOFException: EOF reading from Socket[addr=/192.168.1.5,port=49342,localport=50010]
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
        at java.lang.Thread.run(Thread.java:595)
060205 012926 4215 Received block blk_8833881170910103633 from /192.168.1.9 and mirrored to s5/192.168.1.6:50010
060205 012926 4216 Received block blk_-9065065135965140034 from /192.168.1.1
060205 012926 4217 DataXCeiver
java.io.IOException: Block blk_-803324019234272520 is valid, and cannot be written to.
        at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
        at java.lang.Thread.run(Thread.java:595)
060205 012926 4218 Received block blk_1640893643911715360 from /192.168.1.8 and mirrored to s9/192.168.1.10:50010
060205 012926 4219 Received block blk_8581877778682012452 from /192.168.1.10
060205 012926 4220 Received block blk_5105973706640312551 from /192.168.1.5
060205 012926 4221 DataXCeiver
java.io.IOException: Block blk_8383681341419474724 is valid, and cannot be written to.
        at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
        at java.lang.Thread.run(Thread.java:595)

1:30:51am - The crawldb update starts, and it quickly completes the map portion of the update:

060205 013011 CrawlDb update: starting
060205 013011 CrawlDb update: db: crawl-20060129091444/crawldb
060205 013011 CrawlDb update: segment: crawl-20060129091444/segments/20060205005908
060205 013011 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
060205 013011 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060205 013011 CrawlDb update: Merging segment data into db.
060205 013051 Running job: job_733lh6
060205 013052  map 0%

...snip...

060205 013228  map 100%

1:31:55am - One of the last map tasks for this job is assigned to tracker_69547 (on s5):

060205 013155 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'

1:32:07am - Tracker_69547 (on s5) starts complaining about not being able to read blk_1084437997772174917:

060205 013207 task_m_1yvznp No node available for block blk_1084437997772174917 060205 013207 task_m_1yvznp Could not obtain block from any node: java.io.IOException: No live nodes contain current block 060205 013217 task_m_1yvznp No node available for block blk_1084437997772174917 060205 013217 task_m_1yvznp Could not obtain block from any node: java.io.IOException: No live nodes contain current block

...snip...

060205 014149 task_m_1yvznp No node available for block blk_1084437997772174917 060205 014149 task_m_1yvznp Could not obtain block from any node: java.io.IOException: No live nodes contain current block

1:41:55am - Tracker_69547 (on s5) gives up and kills its child process:

060205 014155 Task task_m_1yvznp timed out.  Killing.
060205 014156 Server connection on port 50050 from 127.0.0.1: exiting
060205 014156 task_m_1yvznp Child Error
java.io.IOException: Task process exit with nonzero status.
        at org.apache.nutch.mapred.TaskRunner.runChild(TaskRunner.java:139)
        at org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:92)
060205 014159 task_m_1yvznp done; removing files.

1:41:59am - The TaskTracker notices that it has lost the map task, so it reassigns it to tracker_69547 (on s5) three more times, with the exact same results in all of the logs. After the fourth failure, it aborts the owning job:

060205 014159 Task 'task_m_1yvznp' has been lost.
060205 014159 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 015205 Task 'task_m_1yvznp' has been lost.
060205 015205 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 020211 Task 'task_m_1yvznp' has been lost.
060205 020212 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 021218 Task 'task_m_1yvznp' has been lost.
060205 021218 Task task_m_1yvznp has failed 4 times. Aborting owning job job_733lh6
060205 021218 Server connection on port 8010 from 127.0.0.1: exiting

2:25:31am - We see the first mention of the "missing" block in the NameNode log. This block was apparently hosted on s2:

060205 022531 Pending transfer (block blk_1084437997772174917) from s2:50010 to 1 destinations

2:25:31am - We see evidence of the "missing" block being successfully transferred in the DataNode log on s2. There is absolutely nothing in the log between a nondescript entry at 1:32:26am (which is after the time that tracker_69547 started complaining about the missing block) and the block transfer messages at 2:25:31am:

060205 013226 4479 Served block blk_4776786656991943447 to /192.168.1.8
060205 022531 11 Starting thread to transfer block blk_-1870083282504339828 to [Lorg.apache.nutch.ndfs.DatanodeInfo;@1561437 060205 022531 11 Starting thread to transfer block blk_1084437997772174917 to [Lorg.apache.nutch.ndfs.DatanodeInfo;@b6b2a5 060205 022531 4480 Transmitted block blk_-1870083282504339828 to s9/192.168.1.10:50010 060205 022531 4481 Transmitted block blk_1084437997772174917 to s9/192.168.1.10:50010


--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------

Reply via email to