Hi,
I've been using Nutch 7 for a few months, and recently started working with 8. I'm testing everything right now on a single server, using the local file system. I generated 10 segments with 100k urls in each, and fetched the content. Then I do the updatedb, but it looks like the crawldb isn't working properly. For example, I ran the updatedb command on one segment, and -stats shows this:

060409 140035 status 1 (DB_unfetched):  1732457
060409 140035 status 2 (DB_fetched):    82608
060409 140035 status 3 (DB_gone):       3447

I then ran the updatedb against the next segment, and -stats now shows this:

060409 150737 status 1 (DB_unfetched):  1777642
060409 150737 status 2 (DB_fetched):    81629
060409 150737 status 3 (DB_gone):       3377


Any idea why the number of fetched urls would actually go down? What I *think* is happening is that the crawldb only contains the data from the last crawl, not the subsequent crawls. Does this make sense? I ran the test doing each segment and running -stats, and they are all around 80k for fetched and 1.7m for unfetched, but the numbers dont seem to be accumulating.

Since readsegs is broken in 8, I can't really get an idea of what is actually in the segments. Is there an alternative way to see how many urls are actually in the segment and fetched?

If you have any ideas, please let me know. Thanks a lot!

Jason

Reply via email to