Hi Matt,

thank you for the reply I will play with what you sugested as it sounds pretty close to what I am after. The only caviet is that all the websites that I will be hosting are running from one source so I am after a way that I can dynamically inject a filter based on what site is calling the searcher. I plan to have nutch running on one web server and access it through its rss REST and my thought was to provide an additional REST parameter like ...&callerID=xxx and have nutch load the appropriate class that would filter the results to a list of domains (stored in a file or database or something) based on the callerID value as the key.

I'm pretty new to nutch ( only working with it for a few days ) so any guideance you can give me will be greatly appreciated.

Cheers,

Kristan Uccello

Matt Kangas wrote:

(note: this is probably more relevant to nutch-user. please send replies there.)

This question seems to come up periodically.

Personally, I accomplish this via a custom URLFilter that uses a MapFile of regex pattern-lists, e.g. one set of regexes per website. You can find the code in http://issues.apache.org/jira/browse/NUTCH-87

All this does is allow you to keep track of a large set of regexes, partitioned by site. It's useful if you want an extremely-focused crawl, possibly burrowing through CGIs.

If instead you want to crawl entire sites, but ignoring CGIs is OK, then PrefixURLFilter is the easiest answer. Create a newline- delimited text file with the site URLs and use this as both seed urls (nutch inject -urlfile) and as the prefixurlfilter config file (set "urlfilter.prefix.file" in nutch-site.xml).

HTH,
--Matt

On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:

Hello all,

I am attempting to modify the RegexUrlFilter and/or the NutchConfig so that I may dynamically apply a set of domain names to fetcher.

In the FAQ:


>>Is it possible to fetch only pages from some specific domains?

>>Please have a look on PrefixURLFilter. Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively.


I wish to be able to provide a list of urls that I want to have fetchedand I want the fetcher to only fetch from those sites (not follow any links out of those sites) I would like to be able to keep adding to this list without having to modify the nutch- config.xml each time but instead just add it to the config (or other object) in memory. All I am after is a point in the right direction. If someone could tell me if I am looking in the wrong files (or off my rocker!) please let me know where I could/should go.

The reason I am asking this is that I am working on a "roll your own search". I want to be able to crawl specific sites only, and then, in the search results, get search results pertaining only to some subset of those crawled sites.

Best regards,

Kristan Uccello


--
Matt Kangas / [EMAIL PROTECTED]




Reply via email to