In this case, we refretch everything in monthly? Why not enough refretch only changed pages (check last modified date and not 404 error).
I think nutch is not able to do this in the moment.

I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?
15 million pages + amount of known links from this pages = amount of urls in db.

Yes. If you would keep more documents in your index, increase the amount of days for refetching.

The dedup only remove from segment index or remove from segments too?
I am not sure.

Matthias


------------------------------------------------------- This SF.net email is sponsored by Demarc: A global provider of Threat Management Solutions. Download our HomeAdmin security software for free today! http://www.demarc.com/Info/Sentarus/hamr30 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to