Based one about one day of reading and experimenting, here's what I think needs to happen:
1. Constantly perform a fetch based crawl in the "whole web" style from the tutorial, or a "nutch crawl -depth 10" once a month.
2. Perform a more frequent "nutch crawl -depth 2" on the site(s) I want fresh data for.
3. Figure out how to merge and or search both together.
Each page in the database has a separate refetch interval. Currently nothing ever changes this from the default, but it would not be hard to write a tool that, e.g., decreased the fetch interval for high-scoring pages from 30 days to one week. That's the intended way to handle this sort of thing.
Question 2:
Is there a way to limit a crawl to a certain number of pages of a given site or crawl, in addition to setting the depth? For example, limit the crawl to 10 levels deep, or 10k documents, whichever it hits first.
That's not currently supported, but it wouldn't be too hard to add...
Doug
------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
