Hello Everyone, I currently have nutch set up doing a "whole-web" style crawl. When I need to index a new page or reindex an existing page immediately, I start a process that waits until the webdb is not being used by the normal crawl process, locks the webdb using the existence of a file as a mutex, and performs a modified inject [which uses WebDBWriter's addPageWithScore() instead of addPageIfNotPresent()] followed by a generate, fetch, updatedb, analyze, index, and deletion of duplicates. The inject score is set to a very high value and I then specify that value as the cutoff for the generate operation. This way I can very quickly manually add new/refreshed pages to my searchable content.
What I would like to do next, is be able to do the same with an entire domain, given only the root web page (being able to reindex everything that matches a given regular expression would be even better). So basically, I would like to take inject a file with root pages, and then crawl from each of these root pages until I have all of the domain's content refreshed. This would of course only be used for small to mid-sized domains under 100 pages or so. The closest I have been able to get to this goal so far is: Inject a url with a high score as above Generate a fetchlist with that high score as the cutoff Update webdb (skip the analyze step - thus outlinks gathered from entirely new content also have the high score assigned to injected urls) Generate with a cutoff ... and do this to a prespecified depth. There are two downsides to this: it uses a prespecified depth, and if the page injected already exists (meaning I would like the domain refreshed instead) the high score and next fetch date are not propagated to outlinks of the injected url (which of course is the desired behavior during a normal crawl). The best thing I can think of now is to write my own external UpdateDatabaseTool that propagates the score and next fetch date to outlinks unconditionally. I would use this tool only for such priority instances. Does anyone know of a better way to approach the implementation of such functionality? Also, I think that in my scenario using something like google's sitemaps (https://www.google.com/webmasters/sitemaps/docs/en/protocol.html) to help direct the crawl would be helpful. Are there any plans to incorporate something of the sort into nutch sometime in the near future? Thanks in advance, Kamil ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
