Ok, thanks for answers. A simpler question: How you use nutch on live db
(how to update pages, delete old datas etc.)?
Andrzej Bialecki wrote:
Matthias Jaekle wrote:
In this case, we refretch everything in monthly? Why not enough
refretch only changed pages (check last modified date and not 404
error).
I think nutch is not able to do this in the moment.
I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?
15 million pages + amount of known links from this pages = amount of
urls in db.
Yes. If you would keep more documents in your index, increase the
amount of days for refetching.
The dedup only remove from segment index or remove from segments too?
I am not sure.
Dedup removes only index entries, duplicate content is left in the
segment data - it would be too costly to remove it. However, duplicate
content is removed if you run the SegmentMergeTool.