Hi,
I've been using Nutch 7 for a few months, and recently started
working with 8. I'm testing everything right now on a single server,
using the local file system. I generated 10 segments with 100k urls in
each, and fetched the content. Then I do the updatedb, but it looks like
the crawldb isn't working properly. For example, I ran the updatedb
command on one segment, and -stats shows this:
060409 140035 status 1 (DB_unfetched): 1732457
060409 140035 status 2 (DB_fetched): 82608
060409 140035 status 3 (DB_gone): 3447
I then ran the updatedb against the next segment, and -stats now shows this:
060409 150737 status 1 (DB_unfetched): 1777642
060409 150737 status 2 (DB_fetched): 81629
060409 150737 status 3 (DB_gone): 3377
Any idea why the number of fetched urls would actually go down? What I
*think* is happening is that the crawldb only contains the data from the
last crawl, not the subsequent crawls. Does this make sense? I ran the
test doing each segment and running -stats, and they are all around 80k
for fetched and 1.7m for unfetched, but the numbers dont seem to be
accumulating.
Since readsegs is broken in 8, I can't really get an idea of what is
actually in the segments. Is there an alternative way to see how many
urls are actually in the segment and fetched?
If you have any ideas, please let me know. Thanks a lot!
Jason