No node available for block errors

Chris Schneider Tue, 07 Feb 2006 19:48:04 -0800

Gang,

At the risk of incurring cross-posting ire (and based on a suggestionfrom Stefan), I'm posting this to nutch-dev as well:

We're now running into "No node available for block <blockID>"errors, which are killing our MapReduce-based crawling jobs. I didsome digging through our logs after one of these events to build acontext for the problem, so I thought I'd share the results with therest of you.

Our crawler died during the reduce part of the crawldb update (thirdtime around fetch loop). The ipc.client.timeout value was set to 1.8M(30 minutes). The job eventually dies because a map task is unable toget a block that it needs. However, this block apparently does existon the DataNode and is successfully transferred for some reason 13minutes after the job gets aborted. A number of "Read timed out" and"EOF reading from Socket" errors do appear in the DataNode log beforethe problem first surfaces. However, there's a lot of successfulactivity as well.

Afterward, I was able to successfully update the crawldb using thecommand-line tool (after a stop-all.sh/start-all.sh sequence for goodmeasure), and I restarted the crawler. It ran for a couple ofgenerate/fetch/update loops before it finally died during URLpartitioning, again due to "No node available for block <blockID>"errors. These errors referred to different blocks on a differentDataNode slave, but the blocks were apparently there on the DataNode,as evidenced by later log activity. Again, a stop-all.sh/start-all.shsequence apparently put the crawler back in business, and it's stillrunning now.

Any light that could be shed on the underlying problem(s) would begreatly appreciated.


Thanks,

- Chris

11:45:24pm - The crawl begins:

bin/do_crawl.sh started at 20060204234524...

12:17:17am - A series of read timeout errors appear in the DataNodelog on s2. No evidence of these problems appears in the NameNode log.


060205 001717 2457 DataXCeiver
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
        at java.io.FilterInputStream.read(FilterInputStream.java:66)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:293)
        at java.lang.Thread.run(Thread.java:595)

...snip...

060205 001803 2523 DataXCeiver
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

at java.io.BufferedInputStream.read(BufferedInputStream.java:235)at java.io.FilterInputStream.read(FilterInputStream.java:66)

        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:293)
        at java.lang.Thread.run(Thread.java:595)

12:19:56am - An EOF error appears in the DataNode log on s2. Noevidence of this problem appears in the NameNode log:


060205 001956 2709 DataXCeiver

java.io.EOFException: EOF reading fromSocket[addr=/192.168.1.9,port=42878,localport=50010]

        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
        at java.lang.Thread.run(Thread.java:595)

12:50:57am - Another series of read timeout errors appear in theDataNode log on s2.


12:53:40am - Another EOF error appears in the DataNode log on s2.

1:27:21am - Another series of read timeout errors appear in theDataNode log on s2. A few EOF errors are sprinkled in as well.

1:29:13am - We see the first mention of blk_1084437997772174917 inthe DataNode log on s2. It appears to have received this block from192.168.1.5 (s4):


060205 012913 4178 Received block blk_1084437997772174917 from /192.168.1.5

1:29:22am - A series of "Block <blockID> is valid, and cannot bewritten to" errors appear in the DataNode log on s2. None of theserefer to blk_1084437997772174917, however:


060205 012922 4205 DataXCeiver

java.io.IOException: Block blk_-4636625784756138174 is valid, andcannot be written to.

        at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
        at java.lang.Thread.run(Thread.java:595)

...snip...

060205 012923 4212 DataXCeiver

java.io.EOFException: EOF reading fromSocket[addr=/192.168.1.4,port=44881,localport=50010]

        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
        at java.lang.Thread.run(Thread.java:595)
060205 012924 4213 Received block blk_-7873577862877309136 from /192.168.1.2
060205 012925 4214 DataXCeiver

java.io.EOFException: EOF reading fromSocket[addr=/192.168.1.5,port=49342,localport=50010]

        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
        at java.lang.Thread.run(Thread.java:595)

060205 012926 4215 Received block blk_8833881170910103633 from/192.168.1.9 and mirrored to s5/192.168.1.6:50010

060205 012926 4216 Received block blk_-9065065135965140034 from /192.168.1.1
060205 012926 4217 DataXCeiver

java.io.IOException: Block blk_-803324019234272520 is valid, andcannot be written to.

        at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
        at java.lang.Thread.run(Thread.java:595)

060205 012926 4218 Received block blk_1640893643911715360 from/192.168.1.8 and mirrored to s9/192.168.1.10:50010

060205 012926 4219 Received block blk_8581877778682012452 from /192.168.1.10
060205 012926 4220 Received block blk_5105973706640312551 from /192.168.1.5
060205 012926 4221 DataXCeiver

java.io.IOException: Block blk_8383681341419474724 is valid, andcannot be written to.

        at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
        at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
        at java.lang.Thread.run(Thread.java:595)

1:30:51am - The crawldb update starts, and it quickly completes themap portion of the update:


060205 013011 CrawlDb update: starting
060205 013011 CrawlDb update: db: crawl-20060129091444/crawldb

060205 013011 CrawlDb update: segment:crawl-20060129091444/segments/20060205005908

060205 013011 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
060205 013011 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060205 013011 CrawlDb update: Merging segment data into db.
060205 013051 Running job: job_733lh6
060205 013052  map 0%

...snip...

060205 013228  map 100%

1:31:55am - One of the last map tasks for this job is assigned totracker_69547 (on s5):


060205 013155 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'

1:32:07am - Tracker_69547 (on s5) starts complaining about not beingable to read blk_1084437997772174917:

060205 013207 task_m_1yvznp No node available for blockblk_1084437997772174917060205 013207 task_m_1yvznp Could not obtain block from any node:java.io.IOException: No live nodes contain current block060205 013217 task_m_1yvznp No node available for blockblk_1084437997772174917060205 013217 task_m_1yvznp Could not obtain block from any node:java.io.IOException: No live nodes contain current block


...snip...

060205 014149 task_m_1yvznp No node available for blockblk_1084437997772174917060205 014149 task_m_1yvznp Could not obtain block from any node:java.io.IOException: No live nodes contain current block


1:41:55am - Tracker_69547 (on s5) gives up and kills its child process:

060205 014155 Task task_m_1yvznp timed out.  Killing.
060205 014156 Server connection on port 50050 from 127.0.0.1: exiting
060205 014156 task_m_1yvznp Child Error
java.io.IOException: Task process exit with nonzero status.
        at org.apache.nutch.mapred.TaskRunner.runChild(TaskRunner.java:139)
        at org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:92)
060205 014159 task_m_1yvznp done; removing files.

1:41:59am - The TaskTracker notices that it has lost the map task, soit reassigns it to tracker_69547 (on s5) three more times, with theexact same results in all of the logs. After the fourth failure, itaborts the owning job:


060205 014159 Task 'task_m_1yvznp' has been lost.
060205 014159 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 015205 Task 'task_m_1yvznp' has been lost.
060205 015205 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 020211 Task 'task_m_1yvznp' has been lost.
060205 020212 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 021218 Task 'task_m_1yvznp' has been lost.

060205 021218 Task task_m_1yvznp has failed 4 times. Aborting owningjob job_733lh6

060205 021218 Server connection on port 8010 from 127.0.0.1: exiting

2:25:31am - We see the first mention of the "missing" block in theNameNode log. This block was apparently hosted on s2:

060205 022531 Pending transfer (block blk_1084437997772174917) froms2:50010 to 1 destinations

2:25:31am - We see evidence of the "missing" block beingsuccessfully transferred in the DataNode log on s2. There isabsolutely nothing in the log between a nondescript entry at1:32:26am (which is after the time that tracker_69547 startedcomplaining about the missing block) and the block transfer messagesat 2:25:31am:


060205 013226 4479 Served block blk_4776786656991943447 to /192.168.1.8

060205 022531 11 Starting thread to transfer blockblk_-1870083282504339828 to[Lorg.apache.nutch.ndfs.DatanodeInfo;@1561437060205 022531 11 Starting thread to transfer blockblk_1084437997772174917 to[Lorg.apache.nutch.ndfs.DatanodeInfo;@b6b2a5060205 022531 4480 Transmitted block blk_-1870083282504339828 tos9/192.168.1.10:50010060205 022531 4481 Transmitted block blk_1084437997772174917 tos9/192.168.1.10:50010



--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------

No node available for block errors

Reply via email to