The regular expressions that you use in your regex-urlfilter.txt file allow you to specify that Nutch should only crawl certain parts of a domain.
For example, you could limit your search to URLs that start with news.domainor www.domain.com/news <http://www.domain.com/news> If you search the mailing list archive or the Nutch WIKI you should be able to find more info on what type of regular expressions the regex-urlfiler.txtfile uses. -Bryan On 10/25/05, XIN LING <[EMAIL PROTECTED]> wrote: > > No, what I mean is a set of URLs in a collection. For example, a finance > web site might divide the web pages into 2 collections, news and > analysis. This way if I am only interested in news, I can refine my > search to this collection, without bothering analysis part. > > I know other search engines can do this, google, htdig, etc. > > Thanks. > > -----Original Message----- > From: Stefan Groschupf [mailto:[EMAIL PROTECTED] > Sent: Tuesday, October 25, 2005 1:38 PM > To: [email protected] > Subject: Re: Collections. > > What do you mean with collections? java.lang.collections? > > Am 25.10.2005 um 20:27 schrieb XIN LING: > > > Hi, does anyone know if Nutch supports collections? How to set > > collections in nutch? > > > > Thanks. > > > > > > --------------------------------------------------------------- > company: http://www.media-style.com > forum: http://www.text-mining.org > blog: http://www.find23.net > > >
