subject:"Re\: Limit Nutch Crawl to Seed URLs"

Re: Limit Nutch Crawl to Seed URLs

2009-03-20 Thread Neera Sharma

Hi Stevan,

I am using db.ignore.external.links property to limit crawl to a domain, and
I and getting a whole bunch of urls from other domains as well. I suppose
they are urls redirected from seed domain urls.

When I tried crawling with filter settings in **regex-urlfiler.txt and
crawl-urlfilter.txt files I didn't see these extra urls and I also
found that more urls from the seed domain were crawled.

For my automated crawling I need to use db.ignore.external.links property,
but I am concerned about the fact that it also results in covering less urls
from the seed domain. Is there a way to fix this ? I don't set TopN in my
implementation.


Thanks and Regards,
Neera

On Fri, Mar 13, 2009 at 6:19 AM, Stevan Kovacevic skovacevi...@gmail.comwrote:

 Hi,
 you can avoid going to other domains by editing the urlfilter file,
 but this is not too practical when you have a lot of seed urls, which
 you do.  In nutch-default.xml file you have a property
 db.ignore.external.links which is by default set to false. Set this to
 true and you will only crawl seed url domains. This file is located in
 the conf folder, in case you don't know. Note that if. while crawling,
 you bump into a link that redirects you to another domain, nutch will
 consider the domain you are redirected to as valid.

 On Fri, Mar 13, 2009 at 10:59 AM, MyD myd.ro...@googlemail.com wrote:
 
  Hi @ all,
 
  is it possible to limit nutchs crawling process to the seed URLs? E.g. I
  have 1000 seed URLs and I want to crawl just this domains. Thanks in
  advance.
 
  Regards,
  MyD
  --
  View this message in context:
 http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html
  Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Limit Nutch Crawl to Seed URLs

2009-03-14 Thread yanky young

domain url filter seems in 1.0, maybe u can just checkout this plugin code
from 1.0 trunk and build it into your 0.9 code base

good luck

yanky

2009/3/14 MyD myd.ro...@googlemail.com


 Where can I find the domain urlfilter? I'm using the branch 0.9...

 Cheers,
 Markus


 Dennis Kubes-2 wrote:
 
  There is a domain-urlfilter that should help do what you are looking for.
 
  Dennis
 
  MyD wrote:
  Hi @ all,
 
  is it possible to limit nutchs crawling process to the seed URLs? E.g. I
  have 1000 seed URLs and I want to crawl just this domains. Thanks in
  advance.
 
  Regards,
  MyD
 
 

 --
 View this message in context:
 http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22509551.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread Stevan Kovacevic

Hi,
you can avoid going to other domains by editing the urlfilter file,
but this is not too practical when you have a lot of seed urls, which
you do.  In nutch-default.xml file you have a property
db.ignore.external.links which is by default set to false. Set this to
true and you will only crawl seed url domains. This file is located in
the conf folder, in case you don't know. Note that if. while crawling,
you bump into a link that redirects you to another domain, nutch will
consider the domain you are redirected to as valid.

On Fri, Mar 13, 2009 at 10:59 AM, MyD myd.ro...@googlemail.com wrote:

 Hi @ all,

 is it possible to limit nutchs crawling process to the seed URLs? E.g. I
 have 1000 seed URLs and I want to crawl just this domains. Thanks in
 advance.

 Regards,
 MyD
 --
 View this message in context: 
 http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread Jack Yu

good point,I use long urlfilter only long time ago

On Fri, Mar 13, 2009 at 9:19 PM, Stevan Kovacevic skovacevi...@gmail.comwrote:

 Hi,
 you can avoid going to other domains by editing the urlfilter file,
 but this is not too practical when you have a lot of seed urls, which
 you do.  In nutch-default.xml file you have a property
 db.ignore.external.links which is by default set to false. Set this to
 true and you will only crawl seed url domains. This file is located in
 the conf folder, in case you don't know. Note that if. while crawling,
 you bump into a link that redirects you to another domain, nutch will
 consider the domain you are redirected to as valid.

 On Fri, Mar 13, 2009 at 10:59 AM, MyD myd.ro...@googlemail.com wrote:
 
  Hi @ all,
 
  is it possible to limit nutchs crawling process to the seed URLs? E.g. I
  have 1000 seed URLs and I want to crawl just this domains. Thanks in
  advance.
 
  Regards,
  MyD
  --
  View this message in context:
 http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html
  Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread Dennis Kubes


There is a domain-urlfilter that should help do what you are looking for.

Dennis

MyD wrote:

Hi @ all,

is it possible to limit nutchs crawling process to the seed URLs? E.g. I
have 1000 seed URLs and I want to crawl just this domains. Thanks in
advance.

Regards,
MyD

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread MyD


Where can I find the domain urlfilter? I'm using the branch 0.9...

Cheers,
Markus


Dennis Kubes-2 wrote:
 
 There is a domain-urlfilter that should help do what you are looking for.
 
 Dennis
 
 MyD wrote:
 Hi @ all,
 
 is it possible to limit nutchs crawling process to the seed URLs? E.g. I
 have 1000 seed URLs and I want to crawl just this domains. Thanks in
 advance.
 
 Regards,
 MyD
 
 

-- 
View this message in context: 
http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22509551.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Limit Nutch Crawl to Seed URLs

Re: Limit Nutch Crawl to Seed URLs

Re: Limit Nutch Crawl to Seed URLs

Re: Limit Nutch Crawl to Seed URLs

Re: Limit Nutch Crawl to Seed URLs

Re: Limit Nutch Crawl to Seed URLs

6 matches

Site Navigation

Mail list logo

Footer information