(note: this is probably more relevant to nutch-user. please send
replies there.)
This question seems to come up periodically.
Personally, I accomplish this via a custom URLFilter that uses a
MapFile of regex pattern-lists, e.g. one set of regexes per website.
You can find the code in http://issues.apache.org/jira/browse/NUTCH-87
All this does is allow you to keep track of a large set of regexes,
partitioned by site. It's useful if you want an extremely-focused
crawl, possibly burrowing through CGIs.
If instead you want to crawl entire sites, but ignoring CGIs is OK,
then PrefixURLFilter is the easiest answer. Create a newline-
delimited text file with the site URLs and use this as both seed urls
(nutch inject -urlfile) and as the prefixurlfilter config file (set
"urlfilter.prefix.file" in nutch-site.xml).
HTH,
--Matt
On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
Hello all,
I am attempting to modify the RegexUrlFilter and/or the NutchConfig
so that I may dynamically apply a set of domain names to fetcher.
In the FAQ:
>>Is it possible to fetch only pages from some specific
domains?
>>Please have a look on PrefixURLFilter. Adding some regular
expressions to the urlfilter.regex.file might work, but adding a
list with thousands of regular expressions would slow down your
system excessively.
I wish to be able to provide a list of urls that I want to have
fetchedand I want the fetcher to only fetch from those sites (not
follow any links out of those sites) I would like to be able to
keep adding to this list without having to modify the nutch-
config.xml each time but instead just add it to the config (or
other object) in memory. All I am after is a point in the right
direction. If someone could tell me if I am looking in the wrong
files (or off my rocker!) please let me know where I could/should go.
The reason I am asking this is that I am working on a "roll your
own search". I want to be able to crawl specific sites only, and
then, in the search results, get search results pertaining only to
some subset of those crawled sites.
Best regards,
Kristan Uccello
--
Matt Kangas / [EMAIL PROTECTED]