Matthias Jaekle wrote:
In this case, we refretch everything in monthly? Why not enough refretch only changed pages (check last modified date and not 404 error).

I think nutch is not able to do this in the moment.

I can fetch topN 500,000 daily -> 500 * 30 = 15 million pages db only?

15 million pages + amount of known links from this pages = amount of urls in db.


Yes. If you would keep more documents in your index, increase the amount of days for refetching.

The dedup only remove from segment index or remove from segments too?

I am not sure.

Dedup removes only index entries, duplicate content is left in the segment data - it would be too costly to remove it. However, duplicate content is removed if you run the SegmentMergeTool.



-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com



Reply via email to