It's probably more an issue with DNS resolution than robots.txt. Even if you respect the robots.txt instructions you can still have N host or even domain names pointing to a single server. This can be avoided in Nutch by setting 'partition.url.mode' and 'fetcher.queue.mode' to 'byIP'.
On 16 August 2010 08:06, CatOs Mandros <[email protected]> wrote: > Rather amusing :) > > Something similar was what made Grub gain a bit of bad reputation... > thank god we have the robots.txt file. > > On Sat, Aug 14, 2010 at 7:48 PM, Mattmann, Chris A (388J) > <[email protected]> wrote: > > LOL... > > > > > > On 8/14/10 8:57 AM, "Ken Krugler" <[email protected]> wrote: > > > > Dear @80legs stop crushing metafilter.com from 2226 distinct IP > addresses. > > Your bots are DDOSing the site with thousands of requests. Stop. > > <http://twitter.com/mathowie/status/20326707535> > > > > -- Ken > > > > > > -------------------------------------------- > > Ken Krugler > > +1 530-210-6378 > > http://bixolabs.com > > e l a s t i c w e b m i n i n g > > > > > > > > > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

