I will indeed! I hadn't thought to check through the properties but I am familiarizing myself with them now. There is certainly a treasure trove of goodness in there.
Thank you for your assistance. Cheers, Vince On 8/2/07, Renaud Richardet <[EMAIL PROTECTED]> wrote: > > hi Vince, > > have you tried this property? > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to include > only initially injected hosts, without creating complex URLFilters. > </description> > </property> > > HTH, > Renaud > > > > Vince Filby wrote: > > Hello, > > > > Is there a way to tell Nutch to only follow links within the domain it > is > > currently crawling? > > > > What I would like to do is pass a list of Url's and Nutch should ignore > all > > outbound links from any domain other than the domain that the link comes > > from. Let's say that I am crawling www.test1.com, I should only follow > > links to www.test1.com. > > > > I realize that I can do this with regex filter *if* I add a regex rule > for > > *each* site that I want to crawl, this solution doesn't scale well for > my > > project. I have also read of a db based url filter that will maintain a > > list of accepted url's in a database. This also doesn't fit well since > I > > don't want to maintain the crawl list and the accepted domain > database. I > > can but it is rather clunky. > > > > I have poked around the source and it looks like the url filtering > mechanism > > only passes the link url and returns a url. So it appears that this is > not > > really possible at the code level without source modifications. I would > > just like to confirm that I am not missing anything obvious before I > start > > reworking the code. > > > > Cheers, > > Vince > > > > > >
