Hello,
I spent some time to analyze it and I am a bit surprised with results. I
always assumed that refetch only works as described in one of email on
the list:
"-refetchonly generates you an segment(FetchList) that only contains the
urls
that need to be refetched based on your refetch interval.
Right, new discovered links are not in the fetchlist that will be
generated by
using this option."
But today after reading the code and performing some experiments I know
it is not true. It works the way you described it. I will post an email
later today with my findings to clarify this issue. So -refetchonly is
not working currently as expected as it generates also new urls (even
though they are not fetched later).
Regards,
Piotr
carmmello wrote:
Hi,
I have about 300 hundred sites. on a specific subject, to start with and
I have used both, the crawl method and the whole internet. Once for
testing purposes, I crawled those sites to depth 2, with the expiring
time of just 1 day (I set this in the site.xml file) and got about
3,000 sites. . After that 1 day I used the command "bin/nutch generate
db segments" with the only flag "-refetchonly". When I did a fetch of
the generated segment, I got about 30,000 sites. If, besides the
refetchonly I have used the -topN 3000, for instances, I would get
diferent sites, not the original ones. So, I really dont't know how,
begining with a initial set of fetched or crawled sites, just to perform
the maintenance of them adding only modified or new sites to the ones
that you already have.
Tanks
------------------------------------------------------------------------
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005