Chris! This is a firm -1 from me! :)

>From the point of view of research and crawling certain pieces of the web, and 
>i strongly agree with you that it should be configurable. But because Nutch 
>being an Apache project, i dismiss it (arguments available upon request). We 
>should adhere to some ethics, it is bad enough that we can just DoS a server 
>by setting some options to a high level. We publish source code, it leaves the 
>option open to everyone to change it, and i think the current situation is 
>balanced enough.
Patching it is simple, i think we should keep it like that :)

Cheers,
Markus
 
 
-----Original message-----
> From:Mattmann, Chris A (3980) <[email protected]>
> Sent: Tuesday 27th January 2015 23:46
> To: [email protected]
> Subject: Option to disable Robots Rule checking
> 
> Hey Guys,
> 
> I’ve recently been made aware of some situations in which
> we are using crawlers like Nutch and we explicitly are looking
> not to honor robots.txt (some for research purposes; some for
> other purposes). Right now, of course, this isn’t possible since
> it’s always explicitly required.
> 
> What would you guys think of as an optional configuration (turned
> off by default) that allows bypassing of Robot rules?
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 

Reply via email to