Re: Limit Nutch Crawl to Seed URLs

Neera Sharma Fri, 20 Mar 2009 17:16:22 -0700

Hi Stevan,

I am using db.ignore.external.links property to limit crawl to a domain, and
I and getting a whole bunch of urls from other domains as well. I suppose
they are urls redirected from seed domain urls.

When I tried crawling with filter settings in **regex-urlfiler.txt and
crawl-urlfilter.txt files I didn't see these extra urls and I also
found that more urls from the seed domain were crawled.

For my automated crawling I need to use db.ignore.external.links property,
but I am concerned about the fact that it also results in covering less urls
from the seed domain. Is there a way to fix this ? I don't set TopN in my
implementation.

Thanks and Regards,
Neera

On Fri, Mar 13, 2009 at 6:19 AM, Stevan Kovacevic <skovacevi...@gmail.com>wrote:

> Hi,
> you can avoid going to other domains by editing the urlfilter file,
> but this is not too practical when you have a lot of seed urls, which
> you do.  In nutch-default.xml file you have a property
> db.ignore.external.links which is by default set to false. Set this to
> true and you will only crawl seed url domains. This file is located in
> the conf folder, in case you don't know. Note that if. while crawling,
> you bump into a link that redirects you to another domain, nutch will
> consider the domain you are redirected to as valid.
>
> On Fri, Mar 13, 2009 at 10:59 AM, MyD <myd.ro...@googlemail.com> wrote:
> >
> > Hi @ all,
> >
> > is it possible to limit nutchs crawling process to the seed URLs? E.g. I
> > have 1000 seed URLs and I want to crawl just this domains. Thanks in
> > advance.
> >
> > Regards,
> > MyD
> > --
> > View this message in context:
> http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>

Re: Limit Nutch Crawl to Seed URLs

Reply via email to