Thanks for the information. But the limits should only be on search
side, not crawl side. I would like to have all the URLs crawled and
indexed. It is the search time that I might put restriction to a certain
collection. 

Google can do this, crawl all URLs, but provide collections to serve
different search purposes.

Thanks.

-----Original Message-----
From: Bryan Woliner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 25, 2005 1:53 PM
To: [email protected]
Subject: Re: Collections.

The regular expressions that you use in your regex-urlfilter.txt file
allow you to specify that Nutch should only crawl certain parts of a
domain.

For example, you could limit your search to URLs that start with
news.domainor www.domain.com/news <http://www.domain.com/news>

If you search the mailing list archive or the Nutch WIKI you should be
able to find more info on what type of regular expressions the
regex-urlfiler.txtfile uses.

-Bryan

On 10/25/05, XIN LING <[EMAIL PROTECTED]> wrote:
>
> No, what I mean is a set of URLs in a collection. For example, a 
> finance web site might divide the web pages into 2 collections, news 
> and analysis. This way if I am only interested in news, I can refine 
> my search to this collection, without bothering analysis part.
>
> I know other search engines can do this, google, htdig, etc.
>
> Thanks.
>
> -----Original Message-----
> From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, October 25, 2005 1:38 PM
> To: [email protected]
> Subject: Re: Collections.
>
> What do you mean with collections? java.lang.collections?
>
> Am 25.10.2005 um 20:27 schrieb XIN LING:
>
> > Hi, does anyone know if Nutch supports collections? How to set 
> > collections in nutch?
> >
> > Thanks.
> >
> >
>
> ---------------------------------------------------------------
> company: http://www.media-style.com
> forum: http://www.text-mining.org
> blog: http://www.find23.net
>
>
>

Reply via email to