|
Hi,
I have about 300 hundred sites. on a specific
subject, to start with and I have used both, the crawl method and the
whole internet. Once for testing purposes, I crawled those sites
to depth 2, with the expiring time of just 1 day (I set this in the
site.xml file) and got about 3,000 sites. . After that 1 day I
used the command "bin/nutch generate db segments" with the only flag
"-refetchonly". When I did a fetch of the generated segment, I got about
30,000 sites. If, besides the refetchonly I have used the -topN 3000,
for instances, I would get diferent sites, not the original ones. So, I really
dont't know how, begining with a initial set of fetched or crawled sites, just
to perform the maintenance of them adding only modified or new sites to the
ones that you already have.
Tanks
|
No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005
