Hello all,

I am attempting to modify the RegexUrlFilter and/or the NutchConfig so that I may dynamically apply a set of domain names to fetcher.

In the FAQ:


          >>Is it possible to fetch only pages from some specific domains?

>>Please have a look on PrefixURLFilter. Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively.


I wish to be able to provide a list of urls that I want to have fetchedand I want the fetcher to only fetch from those sites (not follow any links out of those sites) I would like to be able to keep adding to this list without having to modify the nutch-config.xml each time but instead just add it to the config (or other object) in memory. All I am after is a point in the right direction. If someone could tell me if I am looking in the wrong files (or off my rocker!) please let me know where I could/should go.

The reason I am asking this is that I am working on a "roll your own search". I want to be able to crawl specific sites only, and then, in the search results, get search results pertaining only to some subset of those crawled sites.

Best regards,

Kristan Uccello

Reply via email to