Nutch can this as well, as mentioned you need a index filter.
Take a look to the creative commons index filter, you can easily
customize that to your needs.
HTH
Stefan
Am 25.10.2005 um 20:57 schrieb XIN LING:
Thanks for the information. But the limits should only be on search
side, not crawl side. I would like to have all the URLs crawled and
indexed. It is the search time that I might put restriction to a
certain
collection.
Google can do this, crawl all URLs, but provide collections to serve
different search purposes.
Thanks.
-----Original Message-----
From: Bryan Woliner [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 1:53 PM
To: [email protected]
Subject: Re: Collections.
The regular expressions that you use in your regex-urlfilter.txt file
allow you to specify that Nutch should only crawl certain parts of a
domain.
For example, you could limit your search to URLs that start with
news.domainor www.domain.com/news <http://www.domain.com/news>
If you search the mailing list archive or the Nutch WIKI you should be
able to find more info on what type of regular expressions the
regex-urlfiler.txtfile uses.
-Bryan
On 10/25/05, XIN LING <[EMAIL PROTECTED]> wrote:
No, what I mean is a set of URLs in a collection. For example, a
finance web site might divide the web pages into 2 collections, news
and analysis. This way if I am only interested in news, I can refine
my search to this collection, without bothering analysis part.
I know other search engines can do this, google, htdig, etc.
Thanks.
-----Original Message-----
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 1:38 PM
To: [email protected]
Subject: Re: Collections.
What do you mean with collections? java.lang.collections?
Am 25.10.2005 um 20:27 schrieb XIN LING:
Hi, does anyone know if Nutch supports collections? How to set
collections in nutch?
Thanks.
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net