Hi Lewis,

I believe that you can find the robots.txt of the site here:
http://www.kinoundco.de/robots.txt

I think he followed the instructions at http://lucene.apache.org/nutch/bot.html 
(this outdated URL is still in the HttpBase.java btw) correctly.
My guess is that the guys at pixray.com have configured their own UserAgent 
string and that they might not have configured http.robots.agents properly.
I must say that this is indeed a bit error prone.
Is there a reason why we don't add the http.agent.name to the 
http.robots.agents by default? I think we could make an effort here to make the 
default Nutch installation more politeness robust.

Mathijs
 

On Nov 17, 2011, at 2:52 , Lewis John Mcgibbney wrote:

> Hi Maximilian,
> 
> What Iwere missing is the robots.txt itself. I.e how are you trying to ban 
> Nutch. I've been in touch with the guys at traffic server with your issue to 
> to see if they have suggestions without totally banning all Nutch instances 
> from contacting your webserver.
> 
> To all dev's, the other thing that strikes me as odd is the User-Agent 
> string. Is this really how Nutch identifies itself?
> 
> Thanks
> 
> Lewis
> 
> 2011/11/16 Maximilian Laurenz <[email protected]>
> All requests seem to come from a German company called http://www.pixray.com, 
> which obviously ignores the robots.txt with their version of the Nutch 
> crawler. We informed them and will ban their IP-range, if they don’t stop to 
> scan us with invalid requests.
> 
>  
> 
> Sincerely,
> 
> Maximilian Laurenz
> S&L Medien Gruppe GmbH
> Aidenbachstraße 54
> 81379 München
> Tel. +49 89 790862-49
> Fax +49 89 790862-55
> [email protected]
> http://www.slmedien.de
> 
> S&L Medien Gruppe GmbH | Geschäftsführung: Maria-Theresia von Seidlein, 
> Torsten Weihrich, Olaf Wiehler | Sitz der Gesellschaft: München | Amtsgericht 
> München | HRB 99977
> 
>  
> 
>  
> 
> Von: Maximilian Laurenz 
> Gesendet: Mittwoch, 2. November 2011 14:14
> An: '[email protected]'
> Betreff: Nutch ignores robots.txt
> 
>  
> 
> Hi there,
> 
> Because a Nutch client seems to cause errors on our web server, we changed 
> robots.txt for www.kinoundco.de to disallow Nutch. Unfortunately we still get 
> requests:
> 
>  
> 
> 2011-11-01 05:50:35 W3SVC4 MB 62.128.28.16 GET 
> /Rango/+(Math.random()*100000)+ - 80 - 188.40.65.130 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 373 263 0
> 
> 2011-11-01 05:52:23 W3SVC4 MB 62.128.28.16 GET /default.aspx 
> aspxerrorpath=/Atemlos-Gefaehrliche-Wahrheit/+(Math.random()*100000)+ 80 - 
> 188.40.65.130 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 200 0 32449 315 15
> 
> 2011-11-01 05:59:15 W3SVC4 MB 62.128.28.16 GET 
> /Kleine-wahre-Luegen/+(Math.random()*100000)+ - 80 - 188.40.65.130 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 401 277 15
> 
> 2011-11-01 05:59:31 W3SVC4 MB 62.128.28.16 GET 
> /Nachtasyl/+(Math.random()*100000)+ - 80 - 188.40.65.130 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 381 267 0
> 
> 2011-11-01 06:35:30 W3SVC4 MB 62.128.28.16 GET 
> /Auf-der-anderen-Seite-der-Leinwand-100-Jahre-Moviemento/+(Math.random()*100000)+
>  - 80 - 78.46.90.27 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 473 313 15
> 
> 2011-11-01 06:35:33 W3SVC4 MB 62.128.28.16 GET 
> /Zoowaerter/+(Math.random()*100000)+ - 80 - 78.46.90.27 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 383 268 31
> 
> 2011-11-01 06:35:40 W3SVC4 MB 62.128.28.16 GET /default.aspx 
> aspxerrorpath=/Sascha/+(Math.random()*100000)+ 80 - 78.46.90.27 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 200 0 32449 292 62
> 
> 2011-11-01 06:36:35 W3SVC4 MB 62.128.28.16 GET 
> /Beschissenheit-der-Dinge/+(Math.random()*100000)+ - 80 - 78.46.90.27 
> HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 411 282 31
> 
> 2011-11-01 06:38:14 W3SVC4 MB 62.128.28.16 GET 
> /Auf-der-Suche/+(Math.random()*100000)+ - 80 - 78.46.90.27 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 389 271 15
> 
> 2011-11-01 06:39:55 W3SVC4 MB 62.128.28.16 GET /Fall/+(Math.random()*100000)+ 
> - 80 - 78.46.90.27 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 371 262 15
> 
> 2011-11-01 07:51:10 W3SVC4 MB 62.128.28.16 GET 
> /Midnight-in-Paris/+(Math.random()*100000)+ - 80 - 176.9.26.236 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 397 275 0
> 
> 2011-11-01 07:51:40 W3SVC4 MB 62.128.28.16 GET 
> /Betty-Anne-Waters/+(Math.random()*100000)+ - 80 - 176.9.26.236 HTTP/1.0 
> Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:2.0.1)+Gecko/20100101++++Firefox/4.0.1+++/Nutch-1.2
>  - - www.kinoundco.de 302 0 397 275 15
> 
>  
> 
> Sincerely,
> 
> Max
> 
> Maximilian Laurenz
> S&L Medien Gruppe GmbH
> Aidenbachstraße 54
> 81379 München
> Tel. +49 89 790862-49
> Fax +49 89 790862-55
> [email protected]
> http://www.slmedien.de
> 
> S&L Medien Gruppe GmbH | Geschäftsführung: Maria-Theresia von Seidlein, 
> Torsten Weihrich, Olaf Wiehler | Sitz der Gesellschaft: München | Amtsgericht 
> München | HRB 99977
> 
>  
> 
> 
> 
> 
> -- 
> Lewis 
> 

Reply via email to