On Wed, 23 Mar 2005 09:32:59 -0800, Doug Cutting wrote:
>�Nutch is certainly not meant to only be a whole-web-crawling
>�application. �

That's actually news for me.

> A URL filter can be a plugin, so why not submit your
>�patch as a plugin that's disabled for most folks. �Then folks who
>�want the functionality you describe can simply specify your URL
>�filter plugin. You can supply sample config files &�documentation
>�with the plugin. Does that sound workable?

Would certainly have done so if it were so. But what I need is to configure 
which link and pages get added to the respective dbs (i.e. crawling strategy), 
rather than just URL filtering. I suppose it can be thought of as a 
generalization of both URL filtering, as well as the code in UpdateDatabaseTool 
which uses IGNORE_INTERNAL_LINKS to make decisions on page/link addition.

For that, I need at least
1. URL of the fetched page
2. URL of the link in question

whereas URLFilter only provides the 2nd.

In concrete terms, I'd like to support the following different scenarios
1. only urls with the same TLD as the seed URL are added to page and linkdb
2. same as 1, except also add external links to the linkdb
3. only crawl external links from the seed url

Additionally, perhaps the entire pageContentsChanged() in UpdateDatabaseTool is 
a candidate for converting into a plugin, accepting FetcherOutput, ParseData 
and a WebDB as parameters.

I'm happy to supply patches, if you can give me feedback on an acceptable 
approach..



-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_idh83&alloc_id149&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to