Re: Intranet crawl and re-fetch - newbie question

Piotr Kosiorowski Mon, 06 Jun 2005 06:13:48 -0700

Hello,

I spent some time to analyze it and I am a bit surprised with results. Ialways assumed that refetch only works as described in one of email onthe list:"-refetchonly generates you an segment(FetchList) that only contains theurls

that need to be refetched based on your refetch interval.

Right, new discovered links are not in the fetchlist that will begenerated by

using this option."

But today after reading the code and performing some experiments I knowit is not true. It works the way you described it. I will post an emaillater today with my findings to clarify this issue. So -refetchonly isnot working currently as expected as it generates also new urls (eventhough they are not fetched later).

Regards,
Piotr


carmmello wrote:

Hi,
I have about 300 hundred sites. on a specific subject, to start with andI have used both, the crawl method and the whole internet. Once fortesting purposes, I crawled those sites to depth 2, with the expiringtime of just 1 day (I set this in the site.xml file) and got about3,000 sites. . After that 1 day I used the command "bin/nutch generatedb segments" with the only flag "-refetchonly". When I did a fetch ofthe generated segment, I got about 30,000 sites. If, besides therefetchonly I have used the -topN 3000, for instances, I would getdiferent sites, not the original ones. So, I really dont't know how,begining with a initial set of fetched or crawled sites, just to performthe maintenance of them adding only modified or new sites to the onesthat you already have.
Tanks


------------------------------------------------------------------------

No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005

Re: Intranet crawl and re-fetch - newbie question

Reply via email to