Hi,

it appears that nutch doesn't obey the "Crawl-Delay:" robots.txt
statement. Out robots.txt defines a crawl-delay of 30, and most robots
seem to obey it, unlike this nuch from tonight:

209.235.6.4 wikipedia.7val.com - - [30/May/2006:06:34:40 +0200] "GET 
/w/index.php?title=Category:178_births&from=R/7val-fit-sid=ecb6f5bd55541ca2a7be6c12ff597620
 HTTP/1.0" 200 8537 "-" "Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 
Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.2 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)" 
pid:21925
209.235.6.4 wikipedia.7val.com - - [30/May/2006:06:34:40 +0200] "GET 
/w/index.php?title=Category:1681_births&from=A/7val-fit-sid=4600da9ec83ea98b83f398b550c73720
 HTTP/1.0" 200 12338 "-" "Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 
Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.2 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)" 
pid:21926
209.235.6.4 wikipedia.7val.com - - [30/May/2006:06:34:40 +0200] "GET 
/w/index.php?title=Category:1654_deaths&from=S/7val-fit-sid=7ef0086dd404bda46cae6effe8cee010
 HTTP/1.0" 200 10457 "-" "Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 
Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.2 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)" 
pid:21927
209.235.6.4 wikipedia.7val.com - - [30/May/2006:06:34:40 +0200] "GET 
/w/index.php?title=Category:1702_births&from=W/7val-fit-sid=bdb93afe3a197c01d6472a54fcc6c220
 HTTP/1.0" 200 8851 "-" "Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 
Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.2 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)" 
pid:21921
209.235.6.4 wikipedia.7val.com - - [30/May/2006:06:34:40 +0200] "GET 
/w/index.php?title=Category:1674_births&from=K/7val-fit-sid=86f70bf8064640d3a02bda22f2827610
 HTTP/1.0" 200 12428 "-" "Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 
Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.2 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)" 
pid:21920

Do current versions of nutch support crawl-delay, or could you add this to 
future
versions?

regards,


rainer canavan
-- 
Rainer Canavan
Head Of System Administration

Sevenval AG
Bahnhofsvorplatz 1
50667 K´┐Żln

Phone  +49 221 6500789
Fax    +49 221 6500788
Mobile +49 162 2048089

Reply via email to