Hello all,

I am attempting to modify the RegexUrlFilter and/or the NutchConfig so that I may dynamically apply a set of domain names to fetcher.

In the FAQ:


          >>Is it possible to fetch only pages from some specific domains?

>>Please have a look on PrefixURLFilter. Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively.


I wish to be able to provide a list of urls that I want to have fetchedand I want the fetcher to only fetch from those sites (not follow any links out of those sites) I would like to be able to keep adding to this list without having to modify the nutch-config.xml each time but instead just add it to the config (or other object) in memory. All I am after is a point in the right direction. If someone could tell me if I am looking in the wrong files (or off my rocker!) please let me know where I could/should go.

The reason I am asking this is that I am working on a "roll your own search". I want to be able to crawl specific sites only, and then, in the search results, get search results pertaining only to some subset of those crawled sites.

Best regards,

Kristan Uccello



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to