I've been pondering the appropriateness of Nutch for website mirroring (and subsequent searching), basically Teleport Pro-like functionality.
I've already patched Nutch to do this by including hard-coded rules, like only add links from a page if its within the same domain. The current URL filtering mechanism can be extended to provide support for more flexible url filtering (like domain-only, host-only), but this doesn't belong in a whole-web crawling application.
Nutch is certainly not meant to only be a whole-web-crawling application. A URL filter can be a plugin, so why not submit your patch as a plugin that's disabled for most folks. Then folks who want the functionality you describe can simply specify your URL filter plugin. You can supply sample config files & documentation with the plugin. Does that sound workable?
Doug
------------------------------------------------------- This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
