Hi Markus, Thanks for chiming in. I’m reading the below and I see you agree that it should be configurable, but you state that because Nutch is an Apache project, you dismiss the configuration option. What about it being an Apache project makes it any less ethical to simply have a configurable option that is turned off by default that allows the Robot rules to be disabled?
For full disclosure I am looking into re-creating DDOS and other attacks doing some security research and so I have valid use cases here for wanting to do so. You state it’s easy to patch Nutch (you are correct for that matter, it’s a 2 line patch to Fetcher.java to disable the RobotRules check. However, how is it any less easy to have a 1 line patch that someone would have to apply to *override* the *default* behavior I’m suggesting of RobotRules being on in nutch-default.xml? So what I’m stating literally in code is: 1. adding a property like nutch.robots.rules.parser and setting it’s default value to true, which enables the robot rules parser, putting this property say even at the bottom of nutch-default.xml and stating that improper use of this property in regular situations of whole web crawls can really hurt your crawling of a site. 2. Having a check in Fetcher.java that checks for this property, if it’s on, default behavior, if it’s off, skips the check. The benefit being you don’t encourage people like me (and lots of others that I’ve talked to) who would like to use Nutch for some security research for crawling to simply go fork it for a 1 line code change. Really? Is that what you want to encourage? The really negative part about that is that it will encourage me to simply use that forked version. I could maintain a patch file, and apply that, but it’s going to fall out of date with updates to Nutch, I’m going to have to update that patch file if nutch-default.xml changes (and so will other people, etc.) As you already stated too, we have properties in Nutch that can turn Nutch into a DDOS crawler with or without robots.txt rule parsing. We set these properties to *sensible defaults*. I’m proposing a compromise that helps people like me; encourages me to keep using Nutch through simplification; and is no less worse that the few other properties that we already expose in Nutch configuration to allow it to be turned into a DDOS bot (which by the way, there are bad uses of, but there are also good [security research] uses of as well, to prevent the bad guys). I appreciate it if you made it this far and hope you will reconsider. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Markus Jelsma <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, January 27, 2015 at 3:58 PM To: "[email protected]" <[email protected]> Subject: RE: Option to disable Robots Rule checking >Chris! This is a firm -1 from me! :) > >From the point of view of research and crawling certain pieces of the >web, and i strongly agree with you that it should be configurable. But >because Nutch being an Apache project, i dismiss it (arguments available >upon request). We should adhere to some ethics, it is bad enough that we >can just DoS a server by setting some options to a high level. We publish >source code, it leaves the option open to everyone to change it, and i >think the current situation is balanced enough. >Patching it is simple, i think we should keep it like that :) > >Cheers, >Markus > > >-----Original message----- >> From:Mattmann, Chris A (3980) <[email protected]> >> Sent: Tuesday 27th January 2015 23:46 >> To: [email protected] >> Subject: Option to disable Robots Rule checking >> >> Hey Guys, >> >> I’ve recently been made aware of some situations in which >> we are using crawlers like Nutch and we explicitly are looking >> not to honor robots.txt (some for research purposes; some for >> other purposes). Right now, of course, this isn’t possible since >> it’s always explicitly required. >> >> What would you guys think of as an optional configuration (turned >> off by default) that allows bypassing of Robot rules? >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >>

