Re: Limit Nutch Crawl to Seed URLs
Hi Stevan, I am using db.ignore.external.links property to limit crawl to a domain, and I and getting a whole bunch of urls from other domains as well. I suppose they are urls redirected from seed domain urls. When I tried crawling with filter settings in **regex-urlfiler.txt and crawl-urlfilter.txt files I didn't see these extra urls and I also found that more urls from the seed domain were crawled. For my automated crawling I need to use db.ignore.external.links property, but I am concerned about the fact that it also results in covering less urls from the seed domain. Is there a way to fix this ? I don't set TopN in my implementation. Thanks and Regards, Neera On Fri, Mar 13, 2009 at 6:19 AM, Stevan Kovacevic skovacevi...@gmail.comwrote: Hi, you can avoid going to other domains by editing the urlfilter file, but this is not too practical when you have a lot of seed urls, which you do. In nutch-default.xml file you have a property db.ignore.external.links which is by default set to false. Set this to true and you will only crawl seed url domains. This file is located in the conf folder, in case you don't know. Note that if. while crawling, you bump into a link that redirects you to another domain, nutch will consider the domain you are redirected to as valid. On Fri, Mar 13, 2009 at 10:59 AM, MyD myd.ro...@googlemail.com wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Limit Nutch Crawl to Seed URLs
domain url filter seems in 1.0, maybe u can just checkout this plugin code from 1.0 trunk and build it into your 0.9 code base good luck yanky 2009/3/14 MyD myd.ro...@googlemail.com Where can I find the domain urlfilter? I'm using the branch 0.9... Cheers, Markus Dennis Kubes-2 wrote: There is a domain-urlfilter that should help do what you are looking for. Dennis MyD wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22509551.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Limit Nutch Crawl to Seed URLs
Hi, you can avoid going to other domains by editing the urlfilter file, but this is not too practical when you have a lot of seed urls, which you do. In nutch-default.xml file you have a property db.ignore.external.links which is by default set to false. Set this to true and you will only crawl seed url domains. This file is located in the conf folder, in case you don't know. Note that if. while crawling, you bump into a link that redirects you to another domain, nutch will consider the domain you are redirected to as valid. On Fri, Mar 13, 2009 at 10:59 AM, MyD myd.ro...@googlemail.com wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Limit Nutch Crawl to Seed URLs
good point,I use long urlfilter only long time ago On Fri, Mar 13, 2009 at 9:19 PM, Stevan Kovacevic skovacevi...@gmail.comwrote: Hi, you can avoid going to other domains by editing the urlfilter file, but this is not too practical when you have a lot of seed urls, which you do. In nutch-default.xml file you have a property db.ignore.external.links which is by default set to false. Set this to true and you will only crawl seed url domains. This file is located in the conf folder, in case you don't know. Note that if. while crawling, you bump into a link that redirects you to another domain, nutch will consider the domain you are redirected to as valid. On Fri, Mar 13, 2009 at 10:59 AM, MyD myd.ro...@googlemail.com wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Limit Nutch Crawl to Seed URLs
There is a domain-urlfilter that should help do what you are looking for. Dennis MyD wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD
Re: Limit Nutch Crawl to Seed URLs
Where can I find the domain urlfilter? I'm using the branch 0.9... Cheers, Markus Dennis Kubes-2 wrote: There is a domain-urlfilter that should help do what you are looking for. Dennis MyD wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22509551.html Sent from the Nutch - User mailing list archive at Nabble.com.