Doh, I think I found out the problem. After using luke to dig through
the indexed segments, it looks like all of the segments that I generated
contain the same exact urls. When you generate a segment with the top
100k urls, I'm guessing they are not marked in any way to prevent the
next generate from grabbing the same urls? I'd like to generate multiple
segments in a row, and send them off to another server, is this possible
using the local file system?
Jason
Jason Camp wrote:
Hi,
I've been using Nutch 7 for a few months, and recently started
working with 8. I'm testing everything right now on a single server,
using the local file system. I generated 10 segments with 100k urls
in each, and fetched the content. Then I do the updatedb, but it looks
like the crawldb isn't working properly. For example, I ran the
updatedb command on one segment, and -stats shows this:
060409 140035 status 1 (DB_unfetched): 1732457
060409 140035 status 2 (DB_fetched): 82608
060409 140035 status 3 (DB_gone): 3447
I then ran the updatedb against the next segment, and -stats now shows
this:
060409 150737 status 1 (DB_unfetched): 1777642
060409 150737 status 2 (DB_fetched): 81629
060409 150737 status 3 (DB_gone): 3377
Any idea why the number of fetched urls would actually go down? What I
*think* is happening is that the crawldb only contains the data from
the last crawl, not the subsequent crawls. Does this make sense? I ran
the test doing each segment and running -stats, and they are all
around 80k for fetched and 1.7m for unfetched, but the numbers dont
seem to be accumulating.
Since readsegs is broken in 8, I can't really get an idea of what is
actually in the segments. Is there an alternative way to see how many
urls are actually in the segment and fetched?
If you have any ideas, please let me know. Thanks a lot!
Jason
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general