Chris! This is a firm -1 from me! :) >From the point of view of research and crawling certain pieces of the web, and >i strongly agree with you that it should be configurable. But because Nutch >being an Apache project, i dismiss it (arguments available upon request). We >should adhere to some ethics, it is bad enough that we can just DoS a server >by setting some options to a high level. We publish source code, it leaves the >option open to everyone to change it, and i think the current situation is >balanced enough. Patching it is simple, i think we should keep it like that :)
Cheers, Markus -----Original message----- > From:Mattmann, Chris A (3980) <[email protected]> > Sent: Tuesday 27th January 2015 23:46 > To: [email protected] > Subject: Option to disable Robots Rule checking > > Hey Guys, > > I’ve recently been made aware of some situations in which > we are using crawlers like Nutch and we explicitly are looking > not to honor robots.txt (some for research purposes; some for > other purposes). Right now, of course, this isn’t possible since > it’s always explicitly required. > > What would you guys think of as an optional configuration (turned > off by default) that allows bypassing of Robot rules? > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > >

