The regular expressions that you use in your regex-urlfilter.txt file allow
you to specify that Nutch should only crawl certain parts of a domain.

For example, you could limit your search to URLs that start with news.domainor
www.domain.com/news <http://www.domain.com/news>

If you search the mailing list archive or the Nutch WIKI you should be able
to find more info on what type of regular expressions the
regex-urlfiler.txtfile uses.

-Bryan

On 10/25/05, XIN LING <[EMAIL PROTECTED]> wrote:
>
> No, what I mean is a set of URLs in a collection. For example, a finance
> web site might divide the web pages into 2 collections, news and
> analysis. This way if I am only interested in news, I can refine my
> search to this collection, without bothering analysis part.
>
> I know other search engines can do this, google, htdig, etc.
>
> Thanks.
>
> -----Original Message-----
> From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, October 25, 2005 1:38 PM
> To: [email protected]
> Subject: Re: Collections.
>
> What do you mean with collections? java.lang.collections?
>
> Am 25.10.2005 um 20:27 schrieb XIN LING:
>
> > Hi, does anyone know if Nutch supports collections? How to set
> > collections in nutch?
> >
> > Thanks.
> >
> >
>
> ---------------------------------------------------------------
> company: http://www.media-style.com
> forum: http://www.text-mining.org
> blog: http://www.find23.net
>
>
>

Reply via email to