Re: How to hack the config?

Kristan Uccello Tue, 29 Nov 2005 16:16:34 -0800

Hi Matt,

thank you for the reply I will play with what you sugested as it soundspretty close to what I am after. The only caviet is that all thewebsites that I will be hosting are running from one source so I amafter a way that I can dynamically inject a filter based on what site iscalling the searcher. I plan to have nutch running on one web server andaccess it through its rss REST and my thought was to provide anadditional REST parameter like ...&callerID=xxx and have nutch load theappropriate class that would filter the results to a list of domains(stored in a file or database or something) based on the callerID valueas the key.

I'm pretty new to nutch ( only working with it for a few days ) so anyguideance you can give me will be greatly appreciated.


Cheers,

Kristan Uccello

Matt Kangas wrote:

(note: this is probably more relevant to nutch-user. please sendreplies there.)
This question seems to come up periodically.
Personally, I accomplish this via a custom URLFilter that uses aMapFile of regex pattern-lists, e.g. one set of regexes per website.You can find the code in http://issues.apache.org/jira/browse/NUTCH-87
All this does is allow you to keep track of a large set of regexes,partitioned by site. It's useful if you want an extremely-focusedcrawl, possibly burrowing through CGIs.
If instead you want to crawl entire sites, but ignoring CGIs is OK,then PrefixURLFilter is the easiest answer. Create a newline-delimited text file with the site URLs and use this as both seed urls(nutch inject -urlfile) and as the prefixurlfilter config file (set"urlfilter.prefix.file" in nutch-site.xml).
HTH,
--Matt

On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
Hello all,
I am attempting to modify the RegexUrlFilter and/or the NutchConfigso that I may dynamically apply a set of domain names to fetcher.
In the FAQ:
>>Is it possible to fetch only pages from some specificdomains?
>>Please have a look on PrefixURLFilter. Adding some regularexpressions to the urlfilter.regex.file might work, but adding alist with thousands of regular expressions would slow down yoursystem excessively.
I wish to be able to provide a list of urls that I want to havefetchedand I want the fetcher to only fetch from those sites (notfollow any links out of those sites) I would like to be able to keepadding to this list without having to modify the nutch- config.xmleach time but instead just add it to the config (or other object) inmemory. All I am after is a point in the right direction. If someonecould tell me if I am looking in the wrong files (or off my rocker!)please let me know where I could/should go.
The reason I am asking this is that I am working on a "roll your ownsearch". I want to be able to crawl specific sites only, and then,in the search results, get search results pertaining only to somesubset of those crawled sites.
Best regards,

Kristan Uccello
--
Matt Kangas / [EMAIL PROTECTED]

Re: How to hack the config?

Reply via email to