+1 please commit! Thanks seb 

Sent from my iPhone

> On Apr 17, 2015, at 4:15 PM, Sebastian Nagel (JIRA) <[email protected]> wrote:
> 
> 
>     [ 
> https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Sebastian Nagel updated NUTCH-1927:
> -----------------------------------
>    Attachment: test_NUTCH-1927.2015-04-17.txt
>                NUTCH-1927.2015-04-17.patch
> 
> Patch to log more verbosely, here for a test on "localhost":
> {noformat}
> 2015-04-17 21:58:03,902 INFO  protocol.RobotRulesParser - Whitelisted hosts: 
> [localhost]
> ...
> 2015-04-17 21:58:03,906 INFO  api.HttpRobotRulesParser - Whitelisted host 
> found for: http://localhost/foo/index.html
> 2015-04-17 21:58:03,906 INFO  api.HttpRobotRulesParser - Ignoring robots.txt 
> for all URLs from whitelisted host: localhost
> {noformat}
> 
> RobotsRuleParser now implements Tool to leverage testing: properties can be 
> passed via "-Dprop=val", see attached log from test session.
> 
>> Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
>> ---------------------------------------------------------------------------
>> 
>>                Key: NUTCH-1927
>>                URL: https://issues.apache.org/jira/browse/NUTCH-1927
>>            Project: Nutch
>>         Issue Type: New Feature
>>         Components: fetcher
>>           Reporter: Chris A. Mattmann
>>           Assignee: Chris A. Mattmann
>>             Labels: available, patch
>>            Fix For: 1.10
>> 
>>        Attachments: NUTCH-1927.2015-04-16.patch, 
>> NUTCH-1927.2015-04-17.patch, NUTCH-1927.Mattmann.041115.patch.txt, 
>> NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt, 
>> test_NUTCH-1927.2015-04-17.txt
>> 
>> 
>> Based on discussion on the dev list, to use Nutch for some security research 
>> valid use cases (DDoS; DNS and other testing), I am going to create a patch 
>> that allows a whitelist:
>> {code:xml}
>> <property>
>>  <name>robot.rules.whitelist</name>
>>  <value>132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov</value>
>>  <description>Comma separated list of hostnames or IP addresses to ignore 
>> robot rules parsing for.
>>  </description>
>> </property>
>> {code}
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

Reply via email to