Hi Matt,
thank you for the reply I will play with what you sugested as it sounds
pretty close to what I am after. The only caviet is that all the
websites that I will be hosting are running from one source so I am
after a way that I can dynamically inject a filter based on what site is
calling the searcher. I plan to have nutch running on one web server and
access it through its rss REST and my thought was to provide an
additional REST parameter like ...&callerID=xxx and have nutch load the
appropriate class that would filter the results to a list of domains
(stored in a file or database or something) based on the callerID value
as the key.
I'm pretty new to nutch ( only working with it for a few days ) so any
guideance you can give me will be greatly appreciated.
Cheers,
Kristan Uccello
Matt Kangas wrote:
(note: this is probably more relevant to nutch-user. please send
replies there.)
This question seems to come up periodically.
Personally, I accomplish this via a custom URLFilter that uses a
MapFile of regex pattern-lists, e.g. one set of regexes per website.
You can find the code in http://issues.apache.org/jira/browse/NUTCH-87
All this does is allow you to keep track of a large set of regexes,
partitioned by site. It's useful if you want an extremely-focused
crawl, possibly burrowing through CGIs.
If instead you want to crawl entire sites, but ignoring CGIs is OK,
then PrefixURLFilter is the easiest answer. Create a newline-
delimited text file with the site URLs and use this as both seed urls
(nutch inject -urlfile) and as the prefixurlfilter config file (set
"urlfilter.prefix.file" in nutch-site.xml).
HTH,
--Matt
On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
Hello all,
I am attempting to modify the RegexUrlFilter and/or the NutchConfig
so that I may dynamically apply a set of domain names to fetcher.
In the FAQ:
>>Is it possible to fetch only pages from some specific
domains?
>>Please have a look on PrefixURLFilter. Adding some regular
expressions to the urlfilter.regex.file might work, but adding a
list with thousands of regular expressions would slow down your
system excessively.
I wish to be able to provide a list of urls that I want to have
fetchedand I want the fetcher to only fetch from those sites (not
follow any links out of those sites) I would like to be able to keep
adding to this list without having to modify the nutch- config.xml
each time but instead just add it to the config (or other object) in
memory. All I am after is a point in the right direction. If someone
could tell me if I am looking in the wrong files (or off my rocker!)
please let me know where I could/should go.
The reason I am asking this is that I am working on a "roll your own
search". I want to be able to crawl specific sites only, and then,
in the search results, get search results pertaining only to some
subset of those crawled sites.
Best regards,
Kristan Uccello
--
Matt Kangas / [EMAIL PROTECTED]