Hi, On 6/4/07, Doug Cutting <[EMAIL PROTECTED]> wrote: > Does the 0.9 crawl-delay implementation actually permit multiple threads > to access a site simultaneously?
AFAIK, yes. Option fetcher.threads.per.host should be greater than 1 _only_ when you are accessing a site under your control. So, all of nutch's politeness policies are pretty much ignored when fetcher.threads.per.host is greater than 1. Fetcher2 completely ignores nutch's server-delay and site's crawl-delay value if maxThreads > 1 and uses another min.crawl.delay value when accessing the site. I am not sure about Fetcher but I think it is going to allow maxThreads many fetchers to access the site simultaneously then block the next one. There may be a better explanation in this post to nutch-dev: "Fetcher2's delay between successive requests" . > > Doug > > -------- Original Message -------- > Subject: Nutch 0.9 and Crawl-Delay > Date: Sun, 3 Jun 2007 10:50:24 +0200 > From: Lutz Zetzsche <[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Dear Nutch developers, > > I have had problems with a Nutch based robot during the last 12 hours, > which I have now solved by banning this particular bot from my server > (not Nutch completely for the moment). The ilial bot, which created > considerable load on my server, was using the latest Nutch version - > v0.9 - which is now also supporting the crawl-delay directive in the > robots.txt. > > The bot seems to have obeyed the directive - crawl-delay: 10 - as it > visited my website every 15 seconds, which would have been ok, BUT it > then submitted FIVE requests at once (see example log extract below)! 5 > requests at once every 15 seconds is not acceptable on my server, which > is principally serving dynamic content and is often visited by up to 10 > search engines at the same time, alltogether surely creating 99.9% of > the server traffic. > > So my suggestion is that Nutch only submits one request each time, when > it detects a crawl-delay directive in the robots.txt. This is the > behaviour, the MSNbot shows for example. The MSNbot also liked to > submit several requests at once every few seconds, until I added the > crawl-delay directive to my robots.txt. > > > Best wishes > > Lutz Zetzsche > http://www.sea-rescue.de/ > > > > 72.44.58.191 - - [03/Jun/2007:04:40:53 > +0200] "GET /english/Photos+%26+Videos/PV/ HTTP/1.0" 200 > 13661 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet > startup company. For more information please visit > http://www.ilial.com/crawler; http://www.ilial.com/crawler; > [EMAIL PROTECTED])" > 72.44.58.191 - - [03/Jun/2007:04:40:53 > +0200] "GET /english/Links/WRGL/Countries/ HTTP/1.0" 200 > 15048 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet > startup company. For more information please visit > http://www.ilial.com/crawler; http://www.ilial.com/crawler; > [EMAIL PROTECTED])" > 72.44.58.191 - - [03/Jun/2007:04:40:53 > +0200] "GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/ > HTTP/1.0" 200 60041 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles > based Internet startup company. For more information please visit > http://www.ilial.com/crawler; http://www.ilial.com/crawler; > [EMAIL PROTECTED])" > 66.249.72.244 - - [03/Jun/2007:04:40:55 > +0200] "GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/ > HTTP/1.1" 200 17568 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; > +http://www.google.com/bot.html)" > 66.231.189.119 - - [03/Jun/2007:04:40:55 > +0200] "GET > /english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/ > > HTTP/1.0" 200 17193 "-" "Gigabot/2.0 > (http://www.gigablast.com/spider.html)" > 74.6.86.105 - - [03/Jun/2007:04:40:56 > +0200] "GET /dansk/Links/Hermann+Apelt/ HTTP/1.0" 200 > 30496 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; > http://help.yahoo.com/help/us/ysearch/slurp)" > 72.44.58.191 - - [03/Jun/2007:04:40:53 > +0200] "GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/ > HTTP/1.0" 200 16658 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles > based Internet startup company. For more information please visit > http://www.ilial.com/crawler; http://www.ilial.com/crawler; > [EMAIL PROTECTED])" > 72.44.58.191 - - [03/Jun/2007:04:40:53 > +0200] "GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0" > 200 15624 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based > Internet startup company. For more information please visit > http://www.ilial.com/crawler; http://www.ilial.com/crawler; > [EMAIL PROTECTED])" > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers