Re: How to hack the config?

Matt Kangas Tue, 29 Nov 2005 14:44:05 -0800

(note: this is probably more relevant to nutch-user. please sendreplies there.)


This question seems to come up periodically.

Personally, I accomplish this via a custom URLFilter that uses aMapFile of regex pattern-lists, e.g. one set of regexes per website.You can find the code in http://issues.apache.org/jira/browse/NUTCH-87

All this does is allow you to keep track of a large set of regexes,partitioned by site. It's useful if you want an extremely-focusedcrawl, possibly burrowing through CGIs.

If instead you want to crawl entire sites, but ignoring CGIs is OK,then PrefixURLFilter is the easiest answer. Create a newline-delimited text file with the site URLs and use this as both seed urls(nutch inject -urlfile) and as the prefixurlfilter config file (set"urlfilter.prefix.file" in nutch-site.xml).


HTH,
--Matt

On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:

Hello all,
I am attempting to modify the RegexUrlFilter and/or the NutchConfigso that I may dynamically apply a set of domain names to fetcher.
In the FAQ:
>>Is it possible to fetch only pages from some specificdomains?
>>Please have a look on PrefixURLFilter. Adding some regularexpressions to the urlfilter.regex.file might work, but adding alist with thousands of regular expressions would slow down yoursystem excessively.
I wish to be able to provide a list of urls that I want to havefetchedand I want the fetcher to only fetch from those sites (notfollow any links out of those sites) I would like to be able tokeep adding to this list without having to modify the nutch-config.xml each time but instead just add it to the config (orother object) in memory. All I am after is a point in the rightdirection. If someone could tell me if I am looking in the wrongfiles (or off my rocker!) please let me know where I could/should go.
The reason I am asking this is that I am working on a "roll yourown search". I want to be able to crawl specific sites only, andthen, in the search results, get search results pertaining only tosome subset of those crawled sites.
Best regards,

Kristan Uccello


--
Matt Kangas / [EMAIL PROTECTED]

Re: How to hack the config?

Reply via email to