Dear Nutch Project Gurus,
I'm the webmaster of http://swisspig.net/, and I have noticed periodic
access by the Nutch crawler at U Washington. However, today's access
was strange, in that it attempted to crawl to a *portion* of a URL
(which of course is not a link in itself). This might be a bug in the
crawler, or a bug in a modification made by the UW folks. The relevant
log snippets are:
128.208.6.200 - - [11/Jun/2006:18:27:27 -0400] "GET /robots.txt
HTTP/1.0" 200 262 "" "NutchCVS/0.8-dev (Nutch running at UW;
http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED])"
128.208.6.200 - - [11/Jun/2006:18:27:28 -0400] "GET /post.php HTTP/1.0"
200 25000 "" "NutchCVS/0.8-dev (Nutch running at UW;
http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED])"
128.208.6.200 - - [11/Jun/2006:18:27:33 -0400] "GET / HTTP/1.0" 200
25000 "" "NutchCVS/0.8-dev (Nutch running at UW;
http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED])"
128.208.6.200 - - [11/Jun/2006:18:27:38 -0400] "GET /r/post/ HTTP/1.0"
200 25000 "" "NutchCVS/0.8-dev (Nutch running at UW;
http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED])"
Please note that http://swisspig.net/post.php and
http://swisspig.net/r/post/ are scripts (the same script actually -- I
recently migrated from the format "/post.php?id=foo" to "/r/post/foo")
that are not meant to be accessed directly. There are of course no
links from http://swisspig.net/ to these URLs.
Regards,
Brian Ziman
webmaster, swisspig.net