Dear Web Experts America, Please see the message below, regarding your complaint about a Nutch Crawler running on host '[EMAIL PROTECTED]'.
If you can provide us with more detailed information about the incident, we can investigate further. Erik Lundberg Director, CS Laboratory Department of Computer Science & Engineering University of Washington ---------- Original Message ---------- Date: Fri, 23 Sep 2005 12:25:49 -0700 From: WebExpertsAmerica <[EMAIL PROTECTED]> To: [EMAIL PROTECTED], [EMAIL PROTECTED] Cc: nutch-agent@lucene.apache.org Subject: Your Nutch Crawler is Out of Control - Apache Notified You crawler is ignoring our robots.txt file. http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)" 128.95.1.189 You are eating bandwidth at our domain in incredible amounts. This is rude. Please stop or we will be forced to block your IP and the crawler you are using. Best Regards, Web Experts America --------- ----Forwarded Message This may refer to a crawling task I ran intermittently over the last three weeks. We're definitely observing robots.txt, with code that's been widely tested. (Nutch is an Apache project that's been around for 3 years.) It's possible there's a bug in the robots code, but I'd find that somewhat surprising. The only other thing I can think of is that WebExpertsAmerica is a Search Engine Optimization company, and they might be doing something slightly tricky or unusual that confuses Nutch's politeness guarantees. It's hard for me to say much else (eg, how many of their pages we actually crawled, whether this is a widely-seen problem) without a little more info (eg, what domains they're complaining about, what kinds of other complaints we might have received). I'm happy to talk to you or anyone at CAC about needed further action. Note that the task has been complete for some time, and I have no more crawling plans anytime soon. --Mike