Question about crawldb and segments

Jason Camp Sun, 09 Apr 2006 15:49:24 -0700

Hi,

I've been using Nutch 7 for a few months, and recently startedworking with 8. I'm testing everything right now on a single server,using the local file system. I generated 10 segments with 100k urls ineach, and fetched the content. Then I do the updatedb, but it looks likethe crawldb isn't working properly. For example, I ran the updatedbcommand on one segment, and -stats shows this:


060409 140035 status 1 (DB_unfetched):  1732457
060409 140035 status 2 (DB_fetched):    82608
060409 140035 status 3 (DB_gone):       3447

I then ran the updatedb against the next segment, and -stats now shows this:

060409 150737 status 1 (DB_unfetched):  1777642
060409 150737 status 2 (DB_fetched):    81629
060409 150737 status 3 (DB_gone):       3377

Any idea why the number of fetched urls would actually go down? What I*think* is happening is that the crawldb only contains the data from thelast crawl, not the subsequent crawls. Does this make sense? I ran thetest doing each segment and running -stats, and they are all around 80kfor fetched and 1.7m for unfetched, but the numbers dont seem to beaccumulating.

Since readsegs is broken in 8, I can't really get an idea of what isactually in the segments. Is there an alternative way to see how manyurls are actually in the segment and fetched?


If you have any ideas, please let me know. Thanks a lot!

Jason

Question about crawldb and segments

Reply via email to