Nutch can this as well, as mentioned you need a index filter.
Take a look to the creative commons index filter, you can easily customize that to your needs.

HTH
Stefan

Am 25.10.2005 um 20:57 schrieb XIN LING:

Thanks for the information. But the limits should only be on search
side, not crawl side. I would like to have all the URLs crawled and
indexed. It is the search time that I might put restriction to a certain
collection.

Google can do this, crawl all URLs, but provide collections to serve
different search purposes.

Thanks.

-----Original Message-----
From: Bryan Woliner [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 1:53 PM
To: [email protected]
Subject: Re: Collections.

The regular expressions that you use in your regex-urlfilter.txt file
allow you to specify that Nutch should only crawl certain parts of a
domain.

For example, you could limit your search to URLs that start with
news.domainor www.domain.com/news <http://www.domain.com/news>

If you search the mailing list archive or the Nutch WIKI you should be
able to find more info on what type of regular expressions the
regex-urlfiler.txtfile uses.

-Bryan

On 10/25/05, XIN LING <[EMAIL PROTECTED]> wrote:


No, what I mean is a set of URLs in a collection. For example, a
finance web site might divide the web pages into 2 collections, news
and analysis. This way if I am only interested in news, I can refine
my search to this collection, without bothering analysis part.

I know other search engines can do this, google, htdig, etc.

Thanks.

-----Original Message-----
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 1:38 PM
To: [email protected]
Subject: Re: Collections.

What do you mean with collections? java.lang.collections?

Am 25.10.2005 um 20:27 schrieb XIN LING:


Hi, does anyone know if Nutch supports collections? How to set
collections in nutch?

Thanks.




---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net







---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply via email to