Thanks for help, and good answers.
Matthias Jaekle wrotte:
In this case, we refretch everything in monthly? Why not enough refretch only changed pages (check last modified date and not 404 error).
I think nutch is not able to do this in the moment.
I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?
15 million pages + amount of known links from this pages = amount of urls in db.
Yes. If you would keep more documents in your index, increase the amount of days for refetching.
The dedup only remove from segment index or remove from segments too?
I am not sure.
Matthias
