I've done some additional testing on this problem.  I deleted the
crawl database and recreated it by injecting my initial url list.

bin/nutch inject crawls/temp/crawl/crawldb crawls/temp/crawl/urls/
Injector: starting
Injector: crawlDb: crawls/temp/crawl/crawldb
Injector: urlDir: crawls/temp/crawl/urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done

It seemed to work fine, so I ran a -stats on the database.

bin/nutch readdb crawls/temp/crawl/crawldb/ -stats
CrawlDb statistics start: crawls/temp/crawl/crawldb/
Statistics for CrawlDb: crawls/temp/crawl/crawldb/
TOTAL urls:     152298
retry 0:        152298
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        152298
CrawlDb statistics: done

So far so good.  Now I run the update with the segment I fetched/parsed.

bin/nutch updatedb crawls/temp/crawl/crawldb/
crawls/temp/crawl/segments/20070912153051/
CrawlDb update: starting
CrawlDb update: db: crawls/temp/crawl/crawldb
CrawlDb update: segments: [crawls/temp/crawl/segments/20070912153051]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done

Again, no errors.  Everything seemed to work fine.  Now I run a -stats
on the crawl database again to see the results.

bin/nutch readdb crawls/temp/crawl/crawldb/ -stats
CrawlDb statistics start: crawls/temp/crawl/crawldb/
Statistics for CrawlDb: crawls/temp/crawl/crawldb/
TOTAL urls:     2968789
retry 0:        2945994
retry 1:        22795
min score:      0.0090
avg score:      0.088
max score:      448.086
status 1 (db_unfetched):        2968789
CrawlDb statistics: done

Same problem.  It added the new URLs that were discovered parsing, but
not a single URL has been marked as fetched in the database.

I thought there was a slight possibility that the url list I used to
do the injection for my test wasn't the same one I used a couple days
ago.  If that was the case it might explain why new urls were added
but none of the existing ones were updated, so I dumped the segment to
a file and searched for some of the urls in it that were in the url
list I just used.  Sure enough, they are there.  Not only are they in
the segment, but the segment contains CrawlDatum for them with a
status of 33 (fetch_success).  However, looking in the database with a
readdb -url finds them with a status of unfetched.

It seems to me that there may be a bug in either updatedb or readdb
-stats.  Can anyone help me out?  I'm really hoping that I'm just
doing something wrong, but I can't figure out what it might be.

Thanks,
Tim


---------- Forwarded message ----------
From: Tim Gautier <[EMAIL PROTECTED]>
Date: Sep 14, 2007 11:06 AM
Subject: Problems with the crawl database
To: [email protected]


Using the source in the trunk as of Sep 5th.

I'm trying to run a crawl of 152 thousand pages and the pages they
link to directly.  I injected the urls into a fresh database,
generated the fetch list, and fetched/parsed the urls just fine.  I
then updated the database from the segment.

I've had some problems trying to fetch all of the next depth at once,
so I generated a fetchlist of the topN 1000000 and fetched/parsed
that.  My intention is to update the crawl database with the results
using the -noadditions flag and then generating another fetch list
using topN 1000000 again to get a new chunk of unfetched pages and
repeating this process until I've fetched all of what is in the crawl
database.  I've done this process before and it's worked fine.

However, after fetching the first chunk I noticed a problem.

bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2968546
retry 0:        2945751
retry 1:        22795
min score:      0.0090
avg score:      0.088
max score:      448.086
status 1 (db_unfetched):        2968546

The crawl database has about 3 million urls in it which is about what
I expected, but it doesn't show any of them as being fetched.  When I
ran the updatedb, it obviously put the new found urls in the database,
but why didn't it mark the initial 152k as being fetched?

I then checked the segments to see if there was a problem there.  I
expected one segment of about 150k urls, and another of 1 million
urls.

bin/nutch readseg -list -dir crawl/segments
NAME            GENERATED       FETCHER START           FETCHER END
         FETCHED PARSED
20070912153051  152055          2007-09-12T15:51:27
2007-09-12T18:46:54     163454  111611
20070912211441  1000000         2007-09-12T22:07:37
2007-09-13T15:44:04     1055928 765981

Well, that looks ok.  So I decided to update the database with these
segments again using the -noadditions argument.

bin/nutch updatedb crawl/crawldb -dir crawl/segments -noAdditions

CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [/user/nutch/crawl/segments/20070912153051,
/user/nutch/crawl/segments/20070912211441]
CrawlDb update: additions allowed: false
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done


That seemed to work fine.  It picked up both the segments.  I did
another -stats on the crawl database to see the effect.

bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2968546
retry 0:        2825340
retry 1:        129033
retry 2:        14173
min score:      0.019
avg score:      0.175
max score:      1466.019
status 1 (db_unfetched):        2854006
status 2 (db_fetched):  96872
status 3 (db_gone):     4854
status 4 (db_redir_temp):       6527
status 5 (db_redir_perm):       6286
status 6 (db_notmodified):      1
CrawlDb statistics: done

According to these stats, the database only has information on 96,872
fetched urls, even though I just updated it from segments with 111611
and 765981 successfully fetched and parsed urls.  Something doesn't
seem to be working right.  The segments were generated from this crawl
database, so almost every url should be present (this is a standard
web crawl with pages that look like a fairly random set of the web).

I decided to do an update yet again, but only using one of the
segments this time.

bin/nutch updatedb crawl/crawldb crawl/segments/20070912153051 -noAdditions

CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20070912153051]
CrawlDb update: additions allowed: false
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done


Ok, received no errors.  Now to check the stats on the crawl database.

bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2968546
retry 0:        2839104
retry 1:        129442
min score:      0.029
avg score:      0.213
max score:      1913.106
status 1 (db_unfetched):        2968546
CrawlDb statistics: done

Now I'm back at square one with the database not showing any fetched
urls.  What's going on?

Thanks,
Tim

Reply via email to