Hello all,
I am attempting to modify the RegexUrlFilter and/or the NutchConfig so
that I may dynamically apply a set of domain names to fetcher.
In the FAQ:
>>Is it possible to fetch only pages from some specific domains?
>>Please have a look on PrefixURLFilter. Adding some regular
expressions to the urlfilter.regex.file might work, but adding a list
with thousands of regular expressions would slow down your system
excessively.
I wish to be able to provide a list of urls that I want to have
fetchedand I want the fetcher to only fetch from those sites (not follow
any links out of those sites) I would like to be able to keep adding to
this list without having to modify the nutch-config.xml each time but
instead just add it to the config (or other object) in memory. All I am
after is a point in the right direction. If someone could tell me if I
am looking in the wrong files (or off my rocker!) please let me know
where I could/should go.
The reason I am asking this is that I am working on a "roll your own
search". I want to be able to crawl specific sites only, and then, in
the search results, get search results pertaining only to some subset of
those crawled sites.
Best regards,
Kristan Uccello