I am happy with is alternative! :)
-----Original message-----
> From:Mattmann, Chris A (3980) <[email protected]>
> Sent: Thursday 29th January 2015 1:21
> To: <[email protected]> <[email protected]>
> Subject: Re: Option to disable Robots Rule checking
>
> Seb I like this idea what do you think Lewis and Markus. Thanks that would
> help me and my use case
>
> Sent from my iPhone
>
> > On Jan 28, 2015, at 3:17 PM, Sebastian Nagel <[email protected]>
> > wrote:
> >
> > Hi Markus, hi Chris, hi Lewis,
> >
> > -1 from me
> >
> > A well-documented property is just an invitation to
> > disable robots rules. A hidden property is also no
> > alternative because it will be soon "documented"
> > in our mailing lists or somewhere on the web.
> >
> > And shall we really remove or reformulate
> > "Our software obeys the robots.txt exclusion standard"
> > on http://nutch.apache.org/bot.html ?
> >
> > Since the agent string sent in the HTTP request always contains "/Nutch-x.x"
> > (it would require also a patch to change it) I wouldn't make
> > it too easy to make Nutch ignore robots.txt.
> >
> >> As you already stated too, we have properties in Nutch that can
> >> turn Nutch into a DDOS crawler with or without robots.txt rule
> >> parsing. We set these properties to *sensible defaults*.
> >
> > If the robots.txt is obeyed web masters can even prevent this
> > by adding a "Crawl-delay" rule to their robots.txt.
> >
> >> (from Chris):
> >> but there are also good [security research] uses of as well
> >
> >> (from Lewis):
> >> I've met many web admins recently that want to search and index
> >> their entire DNS but do not wish to disable their robots.txt filter
> >> in order to do so.
> >
> > Ok, these are valid use cases. They have in common that
> > the Nutch user owns the crawled servers or is (hopefully)
> > explicitly allowed to perform the security research.
> >
> >
> > What about an option (or config file) to exclude explicitly
> > a list of hosts (or IPs) from robots.txt parsing?
> > That would require more effort to configure than a boolean property
> > but because it's explicit, it prevents users from disabling
> > robots.txt in general and also guarantees that
> > the security research is not accidentally "extended".
> >
> >
> > Cheers,
> > Sebastian
> >
> >
> >> On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote:
> >> Hi Markus,
> >>
> >> Thanks for chiming in. I’m reading the below and I see you
> >> agree that it should be configurable, but you state that
> >> because Nutch is an Apache project, you dismiss the configuration
> >> option. What about it being an Apache project makes it any less
> >> ethical to simply have a configurable option that is turned off
> >> by default that allows the Robot rules to be disabled?
> >>
> >> For full disclosure I am looking into re-creating DDOS and other
> >> attacks doing some security research and so I have valid use cases
> >> here for wanting to do so. You state it’s easy to patch Nutch (you
> >> are correct for that matter, it’s a 2 line patch to Fetcher.java
> >> to disable the RobotRules check. However, how is it any less easy
> >> to have a 1 line patch that someone would have to apply to *override*
> >> the *default* behavior I’m suggesting of RobotRules being on in
> >> nutch-default.xml? So what I’m stating literally in code is:
> >>
> >> 1. adding a property like nutch.robots.rules.parser and setting it’s
> >> default value to true, which enables the robot rules parser, putting
> >> this property say even at the bottom of nutch-default.xml and
> >> stating that improper use of this property in regular situations of
> >> whole web crawls can really hurt your crawling of a site.
> >>
> >> 2. Having a check in Fetcher.java that checks for this property, if it’s
> >> on, default behavior, if it’s off, skips the check.
> >>
> >> The benefit being you don’t encourage people like me (and lots of
> >> others that I’ve talked to) who would like to use Nutch for some
> >> security research for crawling to simply go fork it for a 1 line code
> >> change. Really? Is that what you want to encourage? The really negative
> >> part about that is that it will encourage me to simply use that forked
> >> version. I could maintain a patch file, and apply that, but it’s going
> >> to fall out of date with updates to Nutch, I’m going to have to update that
> >> patch file if nutch-default.xml changes (and so will other people, etc.)
> >>
> >> As you already stated too, we have properties in Nutch that can
> >> turn Nutch into a DDOS crawler with or without robots.txt rule
> >> parsing. We set these properties to *sensible defaults*. I’m proposing
> >> a compromise that helps people like me; encourages me to keep using
> >> Nutch through simplification; and is no less worse that the few other
> >> properties that we already expose in Nutch configuration to allow it
> >> to be turned into a DDOS bot (which by the way, there are bad uses of,
> >> but there are also good [security research] uses of as well, to prevent
> >> the bad guys).
> >>
> >> I appreciate it if you made it this far and hope you will reconsider.
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: [email protected]
> >> WWW: http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Markus Jelsma <[email protected]>
> >> Reply-To: "[email protected]" <[email protected]>
> >> Date: Tuesday, January 27, 2015 at 3:58 PM
> >> To: "[email protected]" <[email protected]>
> >> Subject: RE: Option to disable Robots Rule checking
> >>
> >>> Chris! This is a firm -1 from me! :)
> >>>
> >>> From the point of view of research and crawling certain pieces of the
> >>> web, and i strongly agree with you that it should be configurable. But
> >>> because Nutch being an Apache project, i dismiss it (arguments available
> >>> upon request). We should adhere to some ethics, it is bad enough that we
> >>> can just DoS a server by setting some options to a high level. We publish
> >>> source code, it leaves the option open to everyone to change it, and i
> >>> think the current situation is balanced enough.
> >>> Patching it is simple, i think we should keep it like that :)
> >>>
> >>> Cheers,
> >>> Markus
> >>>
> >>>
> >>> -----Original message-----
> >>>> From:Mattmann, Chris A (3980) <[email protected]>
> >>>> Sent: Tuesday 27th January 2015 23:46
> >>>> To: [email protected]
> >>>> Subject: Option to disable Robots Rule checking
> >>>>
> >>>> Hey Guys,
> >>>>
> >>>> I’ve recently been made aware of some situations in which
> >>>> we are using crawlers like Nutch and we explicitly are looking
> >>>> not to honor robots.txt (some for research purposes; some for
> >>>> other purposes). Right now, of course, this isn’t possible since
> >>>> it’s always explicitly required.
> >>>>
> >>>> What would you guys think of as an optional configuration (turned
> >>>> off by default) that allows bypassing of Robot rules?
> >>>>
> >>>> Cheers,
> >>>> Chris
> >>>>
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Chief Architect
> >>>> Instrument Software and Science Data Systems Section (398)
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 168-519, Mailstop: 168-527
> >>>> Email: [email protected]
> >>>> WWW: http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Associate Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
>