Does anyone on the list know whether Nutch 0.5 can be configured to ignore hosts.txt, or whether there is a bug in hosts.txt handling?

In particular, are the folks behind 13.1.101.37 (jumanji.parc.xerox.com) reading this list? Can they look into this? Their Nutch-based crawl is annoying someone! They should also consider changing the agent name and contact address in their Nutch configuration so that folks contact them directly in the future.

Thanks,

Doug

-------- Original Message --------
Subject: [Nutch-admin] Re: Auto-response for your message to [EMAIL PROTECTED], [EMAIL PROTECTED]
Date: Wed, 29 Sep 2004 13:07:52 -0700 (PDT)
From: John Young <[EMAIL PROTECTED]>


Your message to the Nutch fetcher agent has been received.

The Nutch fetcher obeys the robots exclusion standard, so if you wish
to alter how Nutch accesses your site, please visit
http://www.robotstxt.org/.

For more information about the Nutch project, please visit
http://www.nutch.org/.

Thanks!

Nutch


Wrong answer.  Your bot is fetching pages from a subdirectory
on our site which is listed in our robots.txt.  Other bots do
not fetch pages from that directory.

I am trying to help.  If you disregard help for robots.txt
violations, sites will block you.  I am not blocking you, yet.

Perhaps you should reevaluate your auto-responders rule set
to avoid sending out messages like the one above.

Again, from robots.txt:

User-agent: *
Disallow: /games/F
Disallow: /games/O
Disallow: /games/Q
Disallow: /games/special
Disallow: /store/F
Disallow: /store/O
Disallow: /store/Q

A sample of your bot's recent activity:

13.1.101.37 - - [29/Sep/2004:03:30:10 -0700] "GET /store/O/cart.html?ax=refresh&oi=1032542 HTTP/1.0" 302 119 "-"
"NutchCVS/0.05-dev (Nutch; http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED])"
13.1.101.37 - - [29/Sep/2004:03:30:11 -0700] "GET /store/O/cart.html HTTP/1.0" 302 119 "-" "NutchCVS/0.05-dev (Nutch;
http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED])"
13.1.101.37 - - [29/Sep/2004:03:30:12 -0700] "GET /store/O/cart.html HTTP/1.0" 302 119 "-" "NutchCVS/0.05-dev (Nutch;
http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED])"
13.1.101.37 - - [29/Sep/2004:03:30:13 -0700] "GET /store/O/cart.html HTTP/1.0" 302 119 "-" "NutchCVS/0.05-dev (Nutch;
http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED])"






-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-admin mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-admin



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to