Re: Option to disable Robots Rule checking

Mattmann, Chris A (3980) Thu, 29 Jan 2015 07:24:14 -0800

Yay!

OK, I will go ahead and start work on it. Thank you all!


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Markus Jelsma <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Thursday, January 29, 2015 at 3:35 AM
To: "[email protected]" <[email protected]>
Subject: RE: Option to disable Robots Rule checking

>I am happy with is alternative! :)
> 
>-----Original message-----
>> From:Mattmann, Chris A (3980) <[email protected]>
>> Sent: Thursday 29th January 2015 1:21
>> To: <[email protected]> <[email protected]>
>> Subject: Re: Option to disable Robots Rule checking
>> 
>> Seb I like this idea what do you think Lewis and Markus. Thanks that
>>would help me and my use case
>> 
>> Sent from my iPhone
>> 
>> > On Jan 28, 2015, at 3:17 PM, Sebastian Nagel
>><[email protected]> wrote:
>> > 
>> > Hi Markus, hi Chris, hi Lewis,
>> > 
>> > -1 from me
>> > 
>> > A well-documented property is just an invitation to
>> > disable robots rules. A hidden property is also no
>> > alternative because it will be soon "documented"
>> > in our mailing lists or somewhere on the web.
>> > 
>> > And shall we really remove or reformulate
>> > "Our software obeys the robots.txt exclusion standard"
>> > on http://nutch.apache.org/bot.html ?
>> > 
>> > Since the agent string sent in the HTTP request always contains
>>"/Nutch-x.x"
>> > (it would require also a patch to change it) I wouldn't make
>> > it too easy to make Nutch ignore robots.txt.
>> > 
>> >> As you already stated too, we have properties in Nutch that can
>> >> turn Nutch into a DDOS crawler with or without robots.txt rule
>> >> parsing. We set these properties to *sensible defaults*.
>> > 
>> > If the robots.txt is obeyed web masters can even prevent this
>> > by adding a "Crawl-delay" rule to their robots.txt.
>> > 
>> >> (from Chris):
>> >> but there are also good [security research] uses of as well
>> > 
>> >> (from Lewis):
>> >> I've met many web admins recently that want to search and index
>> >> their entire DNS but do not wish to disable their robots.txt filter
>> >> in order to do so.
>> > 
>> > Ok, these are valid use cases. They have in common that
>> > the Nutch user owns the crawled servers or is (hopefully)
>> > explicitly allowed to perform the security research.
>> > 
>> > 
>> > What about an option (or config file) to exclude explicitly
>> > a list of hosts (or IPs) from robots.txt parsing?
>> > That would require more effort to configure than a boolean property
>> > but because it's explicit, it prevents users from disabling
>> > robots.txt in general and also guarantees that
>> > the security research is not accidentally "extended".
>> > 
>> > 
>> > Cheers,
>> > Sebastian
>> > 
>> > 
>> >> On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote:
>> >> Hi Markus,
>> >> 
>> >> Thanks for chiming in. I’m reading the below and I see you
>> >> agree that it should be configurable, but you state that
>> >> because Nutch is an Apache project, you dismiss the configuration
>> >> option. What about it being an Apache project makes it any less
>> >> ethical to simply have a configurable option that is turned off
>> >> by default that allows the Robot rules to be disabled?
>> >> 
>> >> For full disclosure I am looking into re-creating DDOS and other
>> >> attacks doing some security research and so I have valid use cases
>> >> here for wanting to do so. You state it’s easy to patch Nutch (you
>> >> are correct for that matter, it’s a 2 line patch to Fetcher.java
>> >> to disable the RobotRules check. However, how is it any less easy
>> >> to have a 1 line patch that someone would have to apply to *override*
>> >> the *default* behavior I’m suggesting of RobotRules being on in
>> >> nutch-default.xml? So what I’m stating literally in code is:
>> >> 
>> >> 1. adding a property like nutch.robots.rules.parser and setting it’s
>> >> default value to true, which enables the robot rules parser, putting
>> >> this property say even at the bottom of nutch-default.xml and
>> >> stating that improper use of this property in regular situations of
>> >> whole web crawls can really hurt your crawling of a site.
>> >> 
>> >> 2. Having a check in Fetcher.java that checks for this property, if
>>it’s
>> >> on, default behavior, if it’s off, skips the check.
>> >> 
>> >> The benefit being you don’t encourage people like me (and lots of
>> >> others that I’ve talked to) who would like to use Nutch for some
>> >> security research for crawling to simply go fork it for a 1 line code
>> >> change. Really? Is that what you want to encourage? The really
>>negative
>> >> part about that is that it will encourage me to simply use that
>>forked
>> >> version. I could maintain a patch file, and apply that, but it’s
>>going
>> >> to fall out of date with updates to Nutch, I’m going to have to
>>update that
>> >> patch file if nutch-default.xml changes (and so will other people,
>>etc.)
>> >> 
>> >> As you already stated too, we have properties in Nutch that can
>> >> turn Nutch into a DDOS crawler with or without robots.txt rule
>> >> parsing. We set these properties to *sensible defaults*. I’m
>>proposing
>> >> a compromise that helps people like me; encourages me to keep using
>> >> Nutch through simplification; and is no less worse that the few other
>> >> properties that we already expose in Nutch configuration to allow it
>> >> to be turned into a DDOS bot (which by the way, there are bad uses
>>of,
>> >> but there are also good [security research] uses of as well, to
>>prevent
>> >> the bad guys).
>> >> 
>> >> I appreciate it if you made it this far and hope you will reconsider.
>> >> 
>> >> Cheers,
>> >> Chris
>> >> 
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Chris Mattmann, Ph.D.
>> >> Chief Architect
>> >> Instrument Software and Science Data Systems Section (398)
>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >> Office: 168-519, Mailstop: 168-527
>> >> Email: [email protected]
>> >> WWW:  http://sunset.usc.edu/~mattmann/
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Adjunct Associate Professor, Computer Science Department
>> >> University of Southern California, Los Angeles, CA 90089 USA
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> -----Original Message-----
>> >> From: Markus Jelsma <[email protected]>
>> >> Reply-To: "[email protected]" <[email protected]>
>> >> Date: Tuesday, January 27, 2015 at 3:58 PM
>> >> To: "[email protected]" <[email protected]>
>> >> Subject: RE: Option to disable Robots Rule checking
>> >> 
>> >>> Chris! This is a firm -1 from me! :)
>> >>> 
>> >>> From the point of view of research and crawling certain pieces of
>>the
>> >>> web, and i strongly agree with you that it should be configurable.
>>But
>> >>> because Nutch being an Apache project, i dismiss it (arguments
>>available
>> >>> upon request). We should adhere to some ethics, it is bad enough
>>that we
>> >>> can just DoS a server by setting some options to a high level. We
>>publish
>> >>> source code, it leaves the option open to everyone to change it,
>>and i
>> >>> think the current situation is balanced enough.
>> >>> Patching it is simple, i think we should keep it like that :)
>> >>> 
>> >>> Cheers,
>> >>> Markus
>> >>> 
>> >>> 
>> >>> -----Original message-----
>> >>>> From:Mattmann, Chris A (3980) <[email protected]>
>> >>>> Sent: Tuesday 27th January 2015 23:46
>> >>>> To: [email protected]
>> >>>> Subject: Option to disable Robots Rule checking
>> >>>> 
>> >>>> Hey Guys,
>> >>>> 
>> >>>> I’ve recently been made aware of some situations in which
>> >>>> we are using crawlers like Nutch and we explicitly are looking
>> >>>> not to honor robots.txt (some for research purposes; some for
>> >>>> other purposes). Right now, of course, this isn’t possible since
>> >>>> it’s always explicitly required.
>> >>>> 
>> >>>> What would you guys think of as an optional configuration (turned
>> >>>> off by default) that allows bypassing of Robot rules?
>> >>>> 
>> >>>> Cheers,
>> >>>> Chris
>> >>>> 
>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>> Chris Mattmann, Ph.D.
>> >>>> Chief Architect
>> >>>> Instrument Software and Science Data Systems Section (398)
>> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>>> Office: 168-519, Mailstop: 168-527
>> >>>> Email: [email protected]
>> >>>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>> Adjunct Associate Professor, Computer Science Department
>> >>>> University of Southern California, Los Angeles, CA 90089 USA
>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > 
>>

Re: Option to disable Robots Rule checking

Reply via email to