Hi Markus, hi Chris, hi Lewis, -1 from me
A well-documented property is just an invitation to disable robots rules. A hidden property is also no alternative because it will be soon "documented" in our mailing lists or somewhere on the web. And shall we really remove or reformulate "Our software obeys the robots.txt exclusion standard" on http://nutch.apache.org/bot.html ? Since the agent string sent in the HTTP request always contains "/Nutch-x.x" (it would require also a patch to change it) I wouldn't make it too easy to make Nutch ignore robots.txt. > As you already stated too, we have properties in Nutch that can > turn Nutch into a DDOS crawler with or without robots.txt rule > parsing. We set these properties to *sensible defaults*. If the robots.txt is obeyed web masters can even prevent this by adding a "Crawl-delay" rule to their robots.txt. > (from Chris): > but there are also good [security research] uses of as well > (from Lewis): > I've met many web admins recently that want to search and index > their entire DNS but do not wish to disable their robots.txt filter > in order to do so. Ok, these are valid use cases. They have in common that the Nutch user owns the crawled servers or is (hopefully) explicitly allowed to perform the security research. What about an option (or config file) to exclude explicitly a list of hosts (or IPs) from robots.txt parsing? That would require more effort to configure than a boolean property but because it's explicit, it prevents users from disabling robots.txt in general and also guarantees that the security research is not accidentally "extended". Cheers, Sebastian On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote: > Hi Markus, > > Thanks for chiming in. I’m reading the below and I see you > agree that it should be configurable, but you state that > because Nutch is an Apache project, you dismiss the configuration > option. What about it being an Apache project makes it any less > ethical to simply have a configurable option that is turned off > by default that allows the Robot rules to be disabled? > > For full disclosure I am looking into re-creating DDOS and other > attacks doing some security research and so I have valid use cases > here for wanting to do so. You state it’s easy to patch Nutch (you > are correct for that matter, it’s a 2 line patch to Fetcher.java > to disable the RobotRules check. However, how is it any less easy > to have a 1 line patch that someone would have to apply to *override* > the *default* behavior I’m suggesting of RobotRules being on in > nutch-default.xml? So what I’m stating literally in code is: > > 1. adding a property like nutch.robots.rules.parser and setting it’s > default value to true, which enables the robot rules parser, putting > this property say even at the bottom of nutch-default.xml and > stating that improper use of this property in regular situations of > whole web crawls can really hurt your crawling of a site. > > 2. Having a check in Fetcher.java that checks for this property, if it’s > on, default behavior, if it’s off, skips the check. > > The benefit being you don’t encourage people like me (and lots of > others that I’ve talked to) who would like to use Nutch for some > security research for crawling to simply go fork it for a 1 line code > change. Really? Is that what you want to encourage? The really negative > part about that is that it will encourage me to simply use that forked > version. I could maintain a patch file, and apply that, but it’s going > to fall out of date with updates to Nutch, I’m going to have to update that > patch file if nutch-default.xml changes (and so will other people, etc.) > > As you already stated too, we have properties in Nutch that can > turn Nutch into a DDOS crawler with or without robots.txt rule > parsing. We set these properties to *sensible defaults*. I’m proposing > a compromise that helps people like me; encourages me to keep using > Nutch through simplification; and is no less worse that the few other > properties that we already expose in Nutch configuration to allow it > to be turned into a DDOS bot (which by the way, there are bad uses of, > but there are also good [security research] uses of as well, to prevent > the bad guys). > > I appreciate it if you made it this far and hope you will reconsider. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Markus Jelsma <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Tuesday, January 27, 2015 at 3:58 PM > To: "[email protected]" <[email protected]> > Subject: RE: Option to disable Robots Rule checking > >> Chris! This is a firm -1 from me! :) >> >>From the point of view of research and crawling certain pieces of the >> web, and i strongly agree with you that it should be configurable. But >> because Nutch being an Apache project, i dismiss it (arguments available >> upon request). We should adhere to some ethics, it is bad enough that we >> can just DoS a server by setting some options to a high level. We publish >> source code, it leaves the option open to everyone to change it, and i >> think the current situation is balanced enough. >> Patching it is simple, i think we should keep it like that :) >> >> Cheers, >> Markus >> >> >> -----Original message----- >>> From:Mattmann, Chris A (3980) <[email protected]> >>> Sent: Tuesday 27th January 2015 23:46 >>> To: [email protected] >>> Subject: Option to disable Robots Rule checking >>> >>> Hey Guys, >>> >>> I’ve recently been made aware of some situations in which >>> we are using crawlers like Nutch and we explicitly are looking >>> not to honor robots.txt (some for research purposes; some for >>> other purposes). Right now, of course, this isn’t possible since >>> it’s always explicitly required. >>> >>> What would you guys think of as an optional configuration (turned >>> off by default) that allows bypassing of Robot rules? >>> >>> Cheers, >>> Chris >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >

