[
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492506#comment-14492506
]
Chris A. Mattmann commented on NUTCH-1927:
------------------------------------------
Thanks Lewis, and Seb, got it. Will fix the formatting. Seb:
bq. http.robot.rules.whitelist should be empty per default
Yep fixed this (and the surrounding code) in my latest patch. Will upload soon.
bq. the description says "hostnames or IP addresses" - is IP address white
listing supported?
Yep, if a URL uses an IP, this would work fine. However, later it may not work
since we aren't resolving on the fly. Probably shouldn't I guess.
bq. instead of repeatedly splitting whitelisted hosts at ',' use
conf.getStrings(...) to initially fill the white list
ACK, will do.
bq. also the white list is a set and should be stored as such to avoid
iterating over the list as in isWhiteListed()
Meaning then to replace with contains or something?
bq. Why is it necessary to create in Fetcher for every URL a new
WhiteListRobotRules object? Wouldn't it be simpler (and more efficient) to use
the existing cache in RobotRulesParser and just put a reference to a singleton
white list rules object if the host is element of the white list?
Good idea, will do so. New patch coming soon! Are you at ApacheCon?
> Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
> ---------------------------------------------------------------------------
>
> Key: NUTCH-1927
> URL: https://issues.apache.org/jira/browse/NUTCH-1927
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Labels: available, patch
> Fix For: 1.10
>
> Attachments: NUTCH-1927.Mattmann.041115.patch.txt,
> NUTCH-1927.Mattmann.041215.patch.txt
>
>
> Based on discussion on the dev list, to use Nutch for some security research
> valid use cases (DDoS; DNS and other testing), I am going to create a patch
> that allows a whitelist:
> {code:xml}
> <property>
> <name>robot.rules.whitelist</name>
> <value>132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov</value>
> <description>Comma separated list of hostnames or IP addresses to ignore
> robot rules parsing for.
> </description>
> </property>
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)