Yay! OK, I will go ahead and start work on it. Thank you all!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Markus Jelsma <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, January 29, 2015 at 3:35 AM To: "[email protected]" <[email protected]> Subject: RE: Option to disable Robots Rule checking >I am happy with is alternative! :) > >-----Original message----- >> From:Mattmann, Chris A (3980) <[email protected]> >> Sent: Thursday 29th January 2015 1:21 >> To: <[email protected]> <[email protected]> >> Subject: Re: Option to disable Robots Rule checking >> >> Seb I like this idea what do you think Lewis and Markus. Thanks that >>would help me and my use case >> >> Sent from my iPhone >> >> > On Jan 28, 2015, at 3:17 PM, Sebastian Nagel >><[email protected]> wrote: >> > >> > Hi Markus, hi Chris, hi Lewis, >> > >> > -1 from me >> > >> > A well-documented property is just an invitation to >> > disable robots rules. A hidden property is also no >> > alternative because it will be soon "documented" >> > in our mailing lists or somewhere on the web. >> > >> > And shall we really remove or reformulate >> > "Our software obeys the robots.txt exclusion standard" >> > on http://nutch.apache.org/bot.html ? >> > >> > Since the agent string sent in the HTTP request always contains >>"/Nutch-x.x" >> > (it would require also a patch to change it) I wouldn't make >> > it too easy to make Nutch ignore robots.txt. >> > >> >> As you already stated too, we have properties in Nutch that can >> >> turn Nutch into a DDOS crawler with or without robots.txt rule >> >> parsing. We set these properties to *sensible defaults*. >> > >> > If the robots.txt is obeyed web masters can even prevent this >> > by adding a "Crawl-delay" rule to their robots.txt. >> > >> >> (from Chris): >> >> but there are also good [security research] uses of as well >> > >> >> (from Lewis): >> >> I've met many web admins recently that want to search and index >> >> their entire DNS but do not wish to disable their robots.txt filter >> >> in order to do so. >> > >> > Ok, these are valid use cases. They have in common that >> > the Nutch user owns the crawled servers or is (hopefully) >> > explicitly allowed to perform the security research. >> > >> > >> > What about an option (or config file) to exclude explicitly >> > a list of hosts (or IPs) from robots.txt parsing? >> > That would require more effort to configure than a boolean property >> > but because it's explicit, it prevents users from disabling >> > robots.txt in general and also guarantees that >> > the security research is not accidentally "extended". >> > >> > >> > Cheers, >> > Sebastian >> > >> > >> >> On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote: >> >> Hi Markus, >> >> >> >> Thanks for chiming in. I’m reading the below and I see you >> >> agree that it should be configurable, but you state that >> >> because Nutch is an Apache project, you dismiss the configuration >> >> option. What about it being an Apache project makes it any less >> >> ethical to simply have a configurable option that is turned off >> >> by default that allows the Robot rules to be disabled? >> >> >> >> For full disclosure I am looking into re-creating DDOS and other >> >> attacks doing some security research and so I have valid use cases >> >> here for wanting to do so. You state it’s easy to patch Nutch (you >> >> are correct for that matter, it’s a 2 line patch to Fetcher.java >> >> to disable the RobotRules check. However, how is it any less easy >> >> to have a 1 line patch that someone would have to apply to *override* >> >> the *default* behavior I’m suggesting of RobotRules being on in >> >> nutch-default.xml? So what I’m stating literally in code is: >> >> >> >> 1. adding a property like nutch.robots.rules.parser and setting it’s >> >> default value to true, which enables the robot rules parser, putting >> >> this property say even at the bottom of nutch-default.xml and >> >> stating that improper use of this property in regular situations of >> >> whole web crawls can really hurt your crawling of a site. >> >> >> >> 2. Having a check in Fetcher.java that checks for this property, if >>it’s >> >> on, default behavior, if it’s off, skips the check. >> >> >> >> The benefit being you don’t encourage people like me (and lots of >> >> others that I’ve talked to) who would like to use Nutch for some >> >> security research for crawling to simply go fork it for a 1 line code >> >> change. Really? Is that what you want to encourage? The really >>negative >> >> part about that is that it will encourage me to simply use that >>forked >> >> version. I could maintain a patch file, and apply that, but it’s >>going >> >> to fall out of date with updates to Nutch, I’m going to have to >>update that >> >> patch file if nutch-default.xml changes (and so will other people, >>etc.) >> >> >> >> As you already stated too, we have properties in Nutch that can >> >> turn Nutch into a DDOS crawler with or without robots.txt rule >> >> parsing. We set these properties to *sensible defaults*. I’m >>proposing >> >> a compromise that helps people like me; encourages me to keep using >> >> Nutch through simplification; and is no less worse that the few other >> >> properties that we already expose in Nutch configuration to allow it >> >> to be turned into a DDOS bot (which by the way, there are bad uses >>of, >> >> but there are also good [security research] uses of as well, to >>prevent >> >> the bad guys). >> >> >> >> I appreciate it if you made it this far and hope you will reconsider. >> >> >> >> Cheers, >> >> Chris >> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> Chris Mattmann, Ph.D. >> >> Chief Architect >> >> Instrument Software and Science Data Systems Section (398) >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> >> Office: 168-519, Mailstop: 168-527 >> >> Email: [email protected] >> >> WWW: http://sunset.usc.edu/~mattmann/ >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> Adjunct Associate Professor, Computer Science Department >> >> University of Southern California, Los Angeles, CA 90089 USA >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> >> >> >> >> >> >> >> -----Original Message----- >> >> From: Markus Jelsma <[email protected]> >> >> Reply-To: "[email protected]" <[email protected]> >> >> Date: Tuesday, January 27, 2015 at 3:58 PM >> >> To: "[email protected]" <[email protected]> >> >> Subject: RE: Option to disable Robots Rule checking >> >> >> >>> Chris! This is a firm -1 from me! :) >> >>> >> >>> From the point of view of research and crawling certain pieces of >>the >> >>> web, and i strongly agree with you that it should be configurable. >>But >> >>> because Nutch being an Apache project, i dismiss it (arguments >>available >> >>> upon request). We should adhere to some ethics, it is bad enough >>that we >> >>> can just DoS a server by setting some options to a high level. We >>publish >> >>> source code, it leaves the option open to everyone to change it, >>and i >> >>> think the current situation is balanced enough. >> >>> Patching it is simple, i think we should keep it like that :) >> >>> >> >>> Cheers, >> >>> Markus >> >>> >> >>> >> >>> -----Original message----- >> >>>> From:Mattmann, Chris A (3980) <[email protected]> >> >>>> Sent: Tuesday 27th January 2015 23:46 >> >>>> To: [email protected] >> >>>> Subject: Option to disable Robots Rule checking >> >>>> >> >>>> Hey Guys, >> >>>> >> >>>> I’ve recently been made aware of some situations in which >> >>>> we are using crawlers like Nutch and we explicitly are looking >> >>>> not to honor robots.txt (some for research purposes; some for >> >>>> other purposes). Right now, of course, this isn’t possible since >> >>>> it’s always explicitly required. >> >>>> >> >>>> What would you guys think of as an optional configuration (turned >> >>>> off by default) that allows bypassing of Robot rules? >> >>>> >> >>>> Cheers, >> >>>> Chris >> >>>> >> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>>> Chris Mattmann, Ph.D. >> >>>> Chief Architect >> >>>> Instrument Software and Science Data Systems Section (398) >> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> >>>> Office: 168-519, Mailstop: 168-527 >> >>>> Email: [email protected] >> >>>> WWW: http://sunset.usc.edu/~mattmann/ >> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>>> Adjunct Associate Professor, Computer Science Department >> >>>> University of Southern California, Los Angeles, CA 90089 USA >> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > >>

