Hey Guys, I’ve recently been made aware of some situations in which we are using crawlers like Nutch and we explicitly are looking not to honor robots.txt (some for research purposes; some for other purposes). Right now, of course, this isn’t possible since it’s always explicitly required.
What would you guys think of as an optional configuration (turned off by default) that allows bypassing of Robot rules? Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

