Hello,

I spent some time to analyze it and I am a bit surprised with results. I always assumed that refetch only works as described in one of email on the list: "-refetchonly generates you an segment(FetchList) that only contains the urls
that need to be refetched based on your refetch interval.
Right, new discovered links are not in the fetchlist that will be generated by
using this option."

But today after reading the code and performing some experiments I know it is not true. It works the way you described it. I will post an email later today with my findings to clarify this issue. So -refetchonly is not working currently as expected as it generates also new urls (even though they are not fetched later).
Regards,
Piotr


carmmello wrote:
Hi,
I have about 300 hundred sites. on a specific subject, to start with and I have used both, the crawl method and the whole internet. Once for testing purposes, I crawled those sites to depth 2, with the expiring time of just 1 day (I set this in the site.xml file) and got about 3,000 sites. . After that 1 day I used the command "bin/nutch generate db segments" with the only flag "-refetchonly". When I did a fetch of the generated segment, I got about 30,000 sites. If, besides the refetchonly I have used the -topN 3000, for instances, I would get diferent sites, not the original ones. So, I really dont't know how, begining with a initial set of fetched or crawled sites, just to perform the maintenance of them adding only modified or new sites to the ones that you already have.
Tanks


------------------------------------------------------------------------

No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005

Reply via email to