Using the source in the trunk as of Sep 5th.
I'm trying to run a crawl of 152 thousand pages and the pages they
link to directly. I injected the urls into a fresh database,
generated the fetch list, and fetched/parsed the urls just fine. I
then updated the database from the segment.
I've had some problems trying to fetch all of the next depth at once,
so I generated a fetchlist of the topN 1000000 and fetched/parsed
that. My intention is to update the crawl database with the results
using the -noadditions flag and then generating another fetch list
using topN 1000000 again to get a new chunk of unfetched pages and
repeating this process until I've fetched all of what is in the crawl
database. I've done this process before and it's worked fine.
However, after fetching the first chunk I noticed a problem.
bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 2968546
retry 0: 2945751
retry 1: 22795
min score: 0.0090
avg score: 0.088
max score: 448.086
status 1 (db_unfetched): 2968546
The crawl database has about 3 million urls in it which is about what
I expected, but it doesn't show any of them as being fetched. When I
ran the updatedb, it obviously put the new found urls in the database,
but why didn't it mark the initial 152k as being fetched?
I then checked the segments to see if there was a problem there. I
expected one segment of about 150k urls, and another of 1 million
urls.
bin/nutch readseg -list -dir crawl/segments
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20070912153051 152055 2007-09-12T15:51:27
2007-09-12T18:46:54 163454 111611
20070912211441 1000000 2007-09-12T22:07:37
2007-09-13T15:44:04 1055928 765981
Well, that looks ok. So I decided to update the database with these
segments again using the -noadditions argument.
bin/nutch updatedb crawl/crawldb -dir crawl/segments -noAdditions
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [/user/nutch/crawl/segments/20070912153051,
/user/nutch/crawl/segments/20070912211441]
CrawlDb update: additions allowed: false
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
That seemed to work fine. It picked up both the segments. I did
another -stats on the crawl database to see the effect.
bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 2968546
retry 0: 2825340
retry 1: 129033
retry 2: 14173
min score: 0.019
avg score: 0.175
max score: 1466.019
status 1 (db_unfetched): 2854006
status 2 (db_fetched): 96872
status 3 (db_gone): 4854
status 4 (db_redir_temp): 6527
status 5 (db_redir_perm): 6286
status 6 (db_notmodified): 1
CrawlDb statistics: done
According to these stats, the database only has information on 96,872
fetched urls, even though I just updated it from segments with 111611
and 765981 successfully fetched and parsed urls. Something doesn't
seem to be working right. The segments were generated from this crawl
database, so almost every url should be present (this is a standard
web crawl with pages that look like a fairly random set of the web).
I decided to do an update yet again, but only using one of the
segments this time.
bin/nutch updatedb crawl/crawldb crawl/segments/20070912153051 -noAdditions
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20070912153051]
CrawlDb update: additions allowed: false
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Ok, received no errors. Now to check the stats on the crawl database.
bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 2968546
retry 0: 2839104
retry 1: 129442
min score: 0.029
avg score: 0.213
max score: 1913.106
status 1 (db_unfetched): 2968546
CrawlDb statistics: done
Now I'm back at square one with the database not showing any fetched
urls. What's going on?
Thanks,
Tim