Seb I like this idea what do you think Lewis and Markus. Thanks that would help 
me and my use case 

Sent from my iPhone

> On Jan 28, 2015, at 3:17 PM, Sebastian Nagel <[email protected]> 
> wrote:
> 
> Hi Markus, hi Chris, hi Lewis,
> 
> -1 from me
> 
> A well-documented property is just an invitation to
> disable robots rules. A hidden property is also no
> alternative because it will be soon "documented"
> in our mailing lists or somewhere on the web.
> 
> And shall we really remove or reformulate
> "Our software obeys the robots.txt exclusion standard"
> on http://nutch.apache.org/bot.html ?
> 
> Since the agent string sent in the HTTP request always contains "/Nutch-x.x"
> (it would require also a patch to change it) I wouldn't make
> it too easy to make Nutch ignore robots.txt.
> 
>> As you already stated too, we have properties in Nutch that can
>> turn Nutch into a DDOS crawler with or without robots.txt rule
>> parsing. We set these properties to *sensible defaults*.
> 
> If the robots.txt is obeyed web masters can even prevent this
> by adding a "Crawl-delay" rule to their robots.txt.
> 
>> (from Chris):
>> but there are also good [security research] uses of as well
> 
>> (from Lewis):
>> I've met many web admins recently that want to search and index
>> their entire DNS but do not wish to disable their robots.txt filter
>> in order to do so.
> 
> Ok, these are valid use cases. They have in common that
> the Nutch user owns the crawled servers or is (hopefully)
> explicitly allowed to perform the security research.
> 
> 
> What about an option (or config file) to exclude explicitly
> a list of hosts (or IPs) from robots.txt parsing?
> That would require more effort to configure than a boolean property
> but because it's explicit, it prevents users from disabling
> robots.txt in general and also guarantees that
> the security research is not accidentally "extended".
> 
> 
> Cheers,
> Sebastian
> 
> 
>> On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote:
>> Hi Markus,
>> 
>> Thanks for chiming in. I’m reading the below and I see you
>> agree that it should be configurable, but you state that
>> because Nutch is an Apache project, you dismiss the configuration
>> option. What about it being an Apache project makes it any less
>> ethical to simply have a configurable option that is turned off
>> by default that allows the Robot rules to be disabled?
>> 
>> For full disclosure I am looking into re-creating DDOS and other
>> attacks doing some security research and so I have valid use cases
>> here for wanting to do so. You state it’s easy to patch Nutch (you
>> are correct for that matter, it’s a 2 line patch to Fetcher.java
>> to disable the RobotRules check. However, how is it any less easy
>> to have a 1 line patch that someone would have to apply to *override*
>> the *default* behavior I’m suggesting of RobotRules being on in
>> nutch-default.xml? So what I’m stating literally in code is:
>> 
>> 1. adding a property like nutch.robots.rules.parser and setting it’s
>> default value to true, which enables the robot rules parser, putting
>> this property say even at the bottom of nutch-default.xml and
>> stating that improper use of this property in regular situations of
>> whole web crawls can really hurt your crawling of a site.
>> 
>> 2. Having a check in Fetcher.java that checks for this property, if it’s
>> on, default behavior, if it’s off, skips the check.
>> 
>> The benefit being you don’t encourage people like me (and lots of
>> others that I’ve talked to) who would like to use Nutch for some
>> security research for crawling to simply go fork it for a 1 line code
>> change. Really? Is that what you want to encourage? The really negative
>> part about that is that it will encourage me to simply use that forked
>> version. I could maintain a patch file, and apply that, but it’s going
>> to fall out of date with updates to Nutch, I’m going to have to update that
>> patch file if nutch-default.xml changes (and so will other people, etc.)
>> 
>> As you already stated too, we have properties in Nutch that can
>> turn Nutch into a DDOS crawler with or without robots.txt rule
>> parsing. We set these properties to *sensible defaults*. I’m proposing
>> a compromise that helps people like me; encourages me to keep using
>> Nutch through simplification; and is no less worse that the few other
>> properties that we already expose in Nutch configuration to allow it
>> to be turned into a DDOS bot (which by the way, there are bad uses of,
>> but there are also good [security research] uses of as well, to prevent
>> the bad guys).
>> 
>> I appreciate it if you made it this far and hope you will reconsider.
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Markus Jelsma <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Tuesday, January 27, 2015 at 3:58 PM
>> To: "[email protected]" <[email protected]>
>> Subject: RE: Option to disable Robots Rule checking
>> 
>>> Chris! This is a firm -1 from me! :)
>>> 
>>> From the point of view of research and crawling certain pieces of the
>>> web, and i strongly agree with you that it should be configurable. But
>>> because Nutch being an Apache project, i dismiss it (arguments available
>>> upon request). We should adhere to some ethics, it is bad enough that we
>>> can just DoS a server by setting some options to a high level. We publish
>>> source code, it leaves the option open to everyone to change it, and i
>>> think the current situation is balanced enough.
>>> Patching it is simple, i think we should keep it like that :)
>>> 
>>> Cheers,
>>> Markus
>>> 
>>> 
>>> -----Original message-----
>>>> From:Mattmann, Chris A (3980) <[email protected]>
>>>> Sent: Tuesday 27th January 2015 23:46
>>>> To: [email protected]
>>>> Subject: Option to disable Robots Rule checking
>>>> 
>>>> Hey Guys,
>>>> 
>>>> I’ve recently been made aware of some situations in which
>>>> we are using crawlers like Nutch and we explicitly are looking
>>>> not to honor robots.txt (some for research purposes; some for
>>>> other purposes). Right now, of course, this isn’t possible since
>>>> it’s always explicitly required.
>>>> 
>>>> What would you guys think of as an optional configuration (turned
>>>> off by default) that allows bypassing of Robot rules?
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: [email protected]
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 

Reply via email to