I've done some additional testing on this problem. I deleted the crawl database and recreated it by injecting my initial url list.
bin/nutch inject crawls/temp/crawl/crawldb crawls/temp/crawl/urls/ Injector: starting Injector: crawlDb: crawls/temp/crawl/crawldb Injector: urlDir: crawls/temp/crawl/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done It seemed to work fine, so I ran a -stats on the database. bin/nutch readdb crawls/temp/crawl/crawldb/ -stats CrawlDb statistics start: crawls/temp/crawl/crawldb/ Statistics for CrawlDb: crawls/temp/crawl/crawldb/ TOTAL urls: 152298 retry 0: 152298 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched): 152298 CrawlDb statistics: done So far so good. Now I run the update with the segment I fetched/parsed. bin/nutch updatedb crawls/temp/crawl/crawldb/ crawls/temp/crawl/segments/20070912153051/ CrawlDb update: starting CrawlDb update: db: crawls/temp/crawl/crawldb CrawlDb update: segments: [crawls/temp/crawl/segments/20070912153051] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: Merging segment data into db. CrawlDb update: done Again, no errors. Everything seemed to work fine. Now I run a -stats on the crawl database again to see the results. bin/nutch readdb crawls/temp/crawl/crawldb/ -stats CrawlDb statistics start: crawls/temp/crawl/crawldb/ Statistics for CrawlDb: crawls/temp/crawl/crawldb/ TOTAL urls: 2968789 retry 0: 2945994 retry 1: 22795 min score: 0.0090 avg score: 0.088 max score: 448.086 status 1 (db_unfetched): 2968789 CrawlDb statistics: done Same problem. It added the new URLs that were discovered parsing, but not a single URL has been marked as fetched in the database. I thought there was a slight possibility that the url list I used to do the injection for my test wasn't the same one I used a couple days ago. If that was the case it might explain why new urls were added but none of the existing ones were updated, so I dumped the segment to a file and searched for some of the urls in it that were in the url list I just used. Sure enough, they are there. Not only are they in the segment, but the segment contains CrawlDatum for them with a status of 33 (fetch_success). However, looking in the database with a readdb -url finds them with a status of unfetched. It seems to me that there may be a bug in either updatedb or readdb -stats. Can anyone help me out? I'm really hoping that I'm just doing something wrong, but I can't figure out what it might be. Thanks, Tim ---------- Forwarded message ---------- From: Tim Gautier <[EMAIL PROTECTED]> Date: Sep 14, 2007 11:06 AM Subject: Problems with the crawl database To: [email protected] Using the source in the trunk as of Sep 5th. I'm trying to run a crawl of 152 thousand pages and the pages they link to directly. I injected the urls into a fresh database, generated the fetch list, and fetched/parsed the urls just fine. I then updated the database from the segment. I've had some problems trying to fetch all of the next depth at once, so I generated a fetchlist of the topN 1000000 and fetched/parsed that. My intention is to update the crawl database with the results using the -noadditions flag and then generating another fetch list using topN 1000000 again to get a new chunk of unfetched pages and repeating this process until I've fetched all of what is in the crawl database. I've done this process before and it's worked fine. However, after fetching the first chunk I noticed a problem. bin/nutch readdb crawl/crawldb -stats CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 2968546 retry 0: 2945751 retry 1: 22795 min score: 0.0090 avg score: 0.088 max score: 448.086 status 1 (db_unfetched): 2968546 The crawl database has about 3 million urls in it which is about what I expected, but it doesn't show any of them as being fetched. When I ran the updatedb, it obviously put the new found urls in the database, but why didn't it mark the initial 152k as being fetched? I then checked the segments to see if there was a problem there. I expected one segment of about 150k urls, and another of 1 million urls. bin/nutch readseg -list -dir crawl/segments NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20070912153051 152055 2007-09-12T15:51:27 2007-09-12T18:46:54 163454 111611 20070912211441 1000000 2007-09-12T22:07:37 2007-09-13T15:44:04 1055928 765981 Well, that looks ok. So I decided to update the database with these segments again using the -noadditions argument. bin/nutch updatedb crawl/crawldb -dir crawl/segments -noAdditions CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [/user/nutch/crawl/segments/20070912153051, /user/nutch/crawl/segments/20070912211441] CrawlDb update: additions allowed: false CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: Merging segment data into db. CrawlDb update: done That seemed to work fine. It picked up both the segments. I did another -stats on the crawl database to see the effect. bin/nutch readdb crawl/crawldb -stats CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 2968546 retry 0: 2825340 retry 1: 129033 retry 2: 14173 min score: 0.019 avg score: 0.175 max score: 1466.019 status 1 (db_unfetched): 2854006 status 2 (db_fetched): 96872 status 3 (db_gone): 4854 status 4 (db_redir_temp): 6527 status 5 (db_redir_perm): 6286 status 6 (db_notmodified): 1 CrawlDb statistics: done According to these stats, the database only has information on 96,872 fetched urls, even though I just updated it from segments with 111611 and 765981 successfully fetched and parsed urls. Something doesn't seem to be working right. The segments were generated from this crawl database, so almost every url should be present (this is a standard web crawl with pages that look like a fairly random set of the web). I decided to do an update yet again, but only using one of the segments this time. bin/nutch updatedb crawl/crawldb crawl/segments/20070912153051 -noAdditions CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20070912153051] CrawlDb update: additions allowed: false CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: Merging segment data into db. CrawlDb update: done Ok, received no errors. Now to check the stats on the crawl database. bin/nutch readdb crawl/crawldb -stats CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 2968546 retry 0: 2839104 retry 1: 129442 min score: 0.029 avg score: 0.213 max score: 1913.106 status 1 (db_unfetched): 2968546 CrawlDb statistics: done Now I'm back at square one with the database not showing any fetched urls. What's going on? Thanks, Tim
