Doh, I think I found out the problem. After using luke to dig through the indexed segments, it looks like all of the segments that I generated contain the same exact urls. When you generate a segment with the top 100k urls, I'm guessing they are not marked in any way to prevent the next generate from grabbing the same urls? I'd like to generate multiple segments in a row, and send them off to another server, is this possible using the local file system?

Jason


Jason Camp wrote:

Hi,
I've been using Nutch 7 for a few months, and recently started working with 8. I'm testing everything right now on a single server, using the local file system. I generated 10 segments with 100k urls in each, and fetched the content. Then I do the updatedb, but it looks like the crawldb isn't working properly. For example, I ran the updatedb command on one segment, and -stats shows this:

060409 140035 status 1 (DB_unfetched):  1732457
060409 140035 status 2 (DB_fetched):    82608
060409 140035 status 3 (DB_gone):       3447

I then ran the updatedb against the next segment, and -stats now shows this:

060409 150737 status 1 (DB_unfetched):  1777642
060409 150737 status 2 (DB_fetched):    81629
060409 150737 status 3 (DB_gone):       3377


Any idea why the number of fetched urls would actually go down? What I *think* is happening is that the crawldb only contains the data from the last crawl, not the subsequent crawls. Does this make sense? I ran the test doing each segment and running -stats, and they are all around 80k for fetched and 1.7m for unfetched, but the numbers dont seem to be accumulating.

Since readsegs is broken in 8, I can't really get an idea of what is actually in the segments. Is there an alternative way to see how many urls are actually in the segment and fetched?

If you have any ideas, please let me know. Thanks a lot!

Jason




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to