Hi Markus,

Thanks for chiming in. I’m reading the below and I see you
agree that it should be configurable, but you state that
because Nutch is an Apache project, you dismiss the configuration
option. What about it being an Apache project makes it any less
ethical to simply have a configurable option that is turned off
by default that allows the Robot rules to be disabled?

For full disclosure I am looking into re-creating DDOS and other
attacks doing some security research and so I have valid use cases
here for wanting to do so. You state it’s easy to patch Nutch (you
are correct for that matter, it’s a 2 line patch to Fetcher.java
to disable the RobotRules check. However, how is it any less easy
to have a 1 line patch that someone would have to apply to *override*
the *default* behavior I’m suggesting of RobotRules being on in
nutch-default.xml? So what I’m stating literally in code is:

1. adding a property like nutch.robots.rules.parser and setting it’s
default value to true, which enables the robot rules parser, putting
this property say even at the bottom of nutch-default.xml and
stating that improper use of this property in regular situations of
whole web crawls can really hurt your crawling of a site.

2. Having a check in Fetcher.java that checks for this property, if it’s
on, default behavior, if it’s off, skips the check.

The benefit being you don’t encourage people like me (and lots of
others that I’ve talked to) who would like to use Nutch for some
security research for crawling to simply go fork it for a 1 line code
change. Really? Is that what you want to encourage? The really negative
part about that is that it will encourage me to simply use that forked
version. I could maintain a patch file, and apply that, but it’s going
to fall out of date with updates to Nutch, I’m going to have to update that
patch file if nutch-default.xml changes (and so will other people, etc.)

As you already stated too, we have properties in Nutch that can
turn Nutch into a DDOS crawler with or without robots.txt rule
parsing. We set these properties to *sensible defaults*. I’m proposing
a compromise that helps people like me; encourages me to keep using
Nutch through simplification; and is no less worse that the few other
properties that we already expose in Nutch configuration to allow it
to be turned into a DDOS bot (which by the way, there are bad uses of,
but there are also good [security research] uses of as well, to prevent
the bad guys).

I appreciate it if you made it this far and hope you will reconsider.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Markus Jelsma <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, January 27, 2015 at 3:58 PM
To: "[email protected]" <[email protected]>
Subject: RE: Option to disable Robots Rule checking

>Chris! This is a firm -1 from me! :)
>
>From the point of view of research and crawling certain pieces of the
>web, and i strongly agree with you that it should be configurable. But
>because Nutch being an Apache project, i dismiss it (arguments available
>upon request). We should adhere to some ethics, it is bad enough that we
>can just DoS a server by setting some options to a high level. We publish
>source code, it leaves the option open to everyone to change it, and i
>think the current situation is balanced enough.
>Patching it is simple, i think we should keep it like that :)
>
>Cheers,
>Markus
> 
> 
>-----Original message-----
>> From:Mattmann, Chris A (3980) <[email protected]>
>> Sent: Tuesday 27th January 2015 23:46
>> To: [email protected]
>> Subject: Option to disable Robots Rule checking
>> 
>> Hey Guys,
>> 
>> I’ve recently been made aware of some situations in which
>> we are using crawlers like Nutch and we explicitly are looking
>> not to honor robots.txt (some for research purposes; some for
>> other purposes). Right now, of course, this isn’t possible since
>> it’s always explicitly required.
>> 
>> What would you guys think of as an optional configuration (turned
>> off by default) that allows bypassing of Robot rules?
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 

Reply via email to