Hi Stevan, I am using db.ignore.external.links property to limit crawl to a domain, and I and getting a whole bunch of urls from other domains as well. I suppose they are urls redirected from seed domain urls.
When I tried crawling with filter settings in **regex-urlfiler.txt and crawl-urlfilter.txt files I didn't see these extra urls and I also found that more urls from the seed domain were crawled. For my automated crawling I need to use db.ignore.external.links property, but I am concerned about the fact that it also results in covering less urls from the seed domain. Is there a way to fix this ? I don't set TopN in my implementation. Thanks and Regards, Neera On Fri, Mar 13, 2009 at 6:19 AM, Stevan Kovacevic <skovacevi...@gmail.com>wrote: > Hi, > you can avoid going to other domains by editing the urlfilter file, > but this is not too practical when you have a lot of seed urls, which > you do. In nutch-default.xml file you have a property > db.ignore.external.links which is by default set to false. Set this to > true and you will only crawl seed url domains. This file is located in > the conf folder, in case you don't know. Note that if. while crawling, > you bump into a link that redirects you to another domain, nutch will > consider the domain you are redirected to as valid. > > On Fri, Mar 13, 2009 at 10:59 AM, MyD <myd.ro...@googlemail.com> wrote: > > > > Hi @ all, > > > > is it possible to limit nutchs crawling process to the seed URLs? E.g. I > > have 1000 seed URLs and I want to crawl just this domains. Thanks in > > advance. > > > > Regards, > > MyD > > -- > > View this message in context: > http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > >