Seb I like this idea what do you think Lewis and Markus. Thanks that would help me and my use case
Sent from my iPhone > On Jan 28, 2015, at 3:17 PM, Sebastian Nagel <[email protected]> > wrote: > > Hi Markus, hi Chris, hi Lewis, > > -1 from me > > A well-documented property is just an invitation to > disable robots rules. A hidden property is also no > alternative because it will be soon "documented" > in our mailing lists or somewhere on the web. > > And shall we really remove or reformulate > "Our software obeys the robots.txt exclusion standard" > on http://nutch.apache.org/bot.html ? > > Since the agent string sent in the HTTP request always contains "/Nutch-x.x" > (it would require also a patch to change it) I wouldn't make > it too easy to make Nutch ignore robots.txt. > >> As you already stated too, we have properties in Nutch that can >> turn Nutch into a DDOS crawler with or without robots.txt rule >> parsing. We set these properties to *sensible defaults*. > > If the robots.txt is obeyed web masters can even prevent this > by adding a "Crawl-delay" rule to their robots.txt. > >> (from Chris): >> but there are also good [security research] uses of as well > >> (from Lewis): >> I've met many web admins recently that want to search and index >> their entire DNS but do not wish to disable their robots.txt filter >> in order to do so. > > Ok, these are valid use cases. They have in common that > the Nutch user owns the crawled servers or is (hopefully) > explicitly allowed to perform the security research. > > > What about an option (or config file) to exclude explicitly > a list of hosts (or IPs) from robots.txt parsing? > That would require more effort to configure than a boolean property > but because it's explicit, it prevents users from disabling > robots.txt in general and also guarantees that > the security research is not accidentally "extended". > > > Cheers, > Sebastian > > >> On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote: >> Hi Markus, >> >> Thanks for chiming in. I’m reading the below and I see you >> agree that it should be configurable, but you state that >> because Nutch is an Apache project, you dismiss the configuration >> option. What about it being an Apache project makes it any less >> ethical to simply have a configurable option that is turned off >> by default that allows the Robot rules to be disabled? >> >> For full disclosure I am looking into re-creating DDOS and other >> attacks doing some security research and so I have valid use cases >> here for wanting to do so. You state it’s easy to patch Nutch (you >> are correct for that matter, it’s a 2 line patch to Fetcher.java >> to disable the RobotRules check. However, how is it any less easy >> to have a 1 line patch that someone would have to apply to *override* >> the *default* behavior I’m suggesting of RobotRules being on in >> nutch-default.xml? So what I’m stating literally in code is: >> >> 1. adding a property like nutch.robots.rules.parser and setting it’s >> default value to true, which enables the robot rules parser, putting >> this property say even at the bottom of nutch-default.xml and >> stating that improper use of this property in regular situations of >> whole web crawls can really hurt your crawling of a site. >> >> 2. Having a check in Fetcher.java that checks for this property, if it’s >> on, default behavior, if it’s off, skips the check. >> >> The benefit being you don’t encourage people like me (and lots of >> others that I’ve talked to) who would like to use Nutch for some >> security research for crawling to simply go fork it for a 1 line code >> change. Really? Is that what you want to encourage? The really negative >> part about that is that it will encourage me to simply use that forked >> version. I could maintain a patch file, and apply that, but it’s going >> to fall out of date with updates to Nutch, I’m going to have to update that >> patch file if nutch-default.xml changes (and so will other people, etc.) >> >> As you already stated too, we have properties in Nutch that can >> turn Nutch into a DDOS crawler with or without robots.txt rule >> parsing. We set these properties to *sensible defaults*. I’m proposing >> a compromise that helps people like me; encourages me to keep using >> Nutch through simplification; and is no less worse that the few other >> properties that we already expose in Nutch configuration to allow it >> to be turned into a DDOS bot (which by the way, there are bad uses of, >> but there are also good [security research] uses of as well, to prevent >> the bad guys). >> >> I appreciate it if you made it this far and hope you will reconsider. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> -----Original Message----- >> From: Markus Jelsma <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, January 27, 2015 at 3:58 PM >> To: "[email protected]" <[email protected]> >> Subject: RE: Option to disable Robots Rule checking >> >>> Chris! This is a firm -1 from me! :) >>> >>> From the point of view of research and crawling certain pieces of the >>> web, and i strongly agree with you that it should be configurable. But >>> because Nutch being an Apache project, i dismiss it (arguments available >>> upon request). We should adhere to some ethics, it is bad enough that we >>> can just DoS a server by setting some options to a high level. We publish >>> source code, it leaves the option open to everyone to change it, and i >>> think the current situation is balanced enough. >>> Patching it is simple, i think we should keep it like that :) >>> >>> Cheers, >>> Markus >>> >>> >>> -----Original message----- >>>> From:Mattmann, Chris A (3980) <[email protected]> >>>> Sent: Tuesday 27th January 2015 23:46 >>>> To: [email protected] >>>> Subject: Option to disable Robots Rule checking >>>> >>>> Hey Guys, >>>> >>>> I’ve recently been made aware of some situations in which >>>> we are using crawlers like Nutch and we explicitly are looking >>>> not to honor robots.txt (some for research purposes; some for >>>> other purposes). Right now, of course, this isn’t possible since >>>> it’s always explicitly required. >>>> >>>> What would you guys think of as an optional configuration (turned >>>> off by default) that allows bypassing of Robot rules? >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >

