On Wed, 23 Mar 2005 09:32:59 -0800, Doug Cutting wrote: >�Nutch is certainly not meant to only be a whole-web-crawling >�application. �
That's actually news for me. > A URL filter can be a plugin, so why not submit your >�patch as a plugin that's disabled for most folks. �Then folks who >�want the functionality you describe can simply specify your URL >�filter plugin. You can supply sample config files &�documentation >�with the plugin. Does that sound workable? Would certainly have done so if it were so. But what I need is to configure which link and pages get added to the respective dbs (i.e. crawling strategy), rather than just URL filtering. I suppose it can be thought of as a generalization of both URL filtering, as well as the code in UpdateDatabaseTool which uses IGNORE_INTERNAL_LINKS to make decisions on page/link addition. For that, I need at least 1. URL of the fetched page 2. URL of the link in question whereas URLFilter only provides the 2nd. In concrete terms, I'd like to support the following different scenarios 1. only urls with the same TLD as the seed URL are added to page and linkdb 2. same as 1, except also add external links to the linkdb 3. only crawl external links from the seed url Additionally, perhaps the entire pageContentsChanged() in UpdateDatabaseTool is a candidate for converting into a plugin, accepting FetcherOutput, ParseData and a WebDB as parameters. I'm happy to supply patches, if you can give me feedback on an acceptable approach.. ------------------------------------------------------- This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register by 3/29 & save $300 http://ads.osdn.com/?ad_idh83&alloc_id149&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
