[
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553264#comment-14553264
]
Chris A. Mattmann commented on NUTCH-1995:
------------------------------------------
Hey Seb, yeah I don't think we should support *, for sure. At the same time,
turning off robots.txt was as easy before as literally commenting out two
lines, and typing ant runtime. We shouldn't fool ourselves that we are
preventing anything still, even with whitelisting. The preceding can still be
done (and I know of many, many situations, valid use cases for security use
cases, in which it is). Like I also said before, all we are doing in those
cases is encouraging people to fork and build their own crawlers, and call it
!= Nutch. I don't think we want that. I personally don't want that. Also I'm
trying to encourage more and more people in that domain to use Nutch - whereas
they've gone off and either built their own; modified Nutch with a 2 line
patch; rebuilt it and called it something else, and/or used Scrapy. All of
those are not ideal solutions IMO.
So, back to the point. Let's check for "*", and of course not support that. But
the other ones, *.blah.*, */* blah, whatever, let's support those. Is that a
fair compromise?
> Add support for wildcard to http.robot.rules.whitelist
> ------------------------------------------------------
>
> Key: NUTCH-1995
> URL: https://issues.apache.org/jira/browse/NUTCH-1995
> Project: Nutch
> Issue Type: Improvement
> Components: robots
> Affects Versions: 1.10
> Reporter: Giuseppe Totaro
> Assignee: Chris A. Mattmann
> Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-1995.patch
>
>
> The {{http.robot.rules.whitelist}}
> ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration
> parameter allows to specify a comma separated list of hostnames or IP
> addresses to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very
> useful and simplify the configuration, for example, if we need to give many
> hostnames/addresses. Here is an example:
> {noformat}
> <name>http.robot.rules.whitelist</name>
> <value>*.sample.com</value>
> <description>Comma separated list of hostnames or IP addresses to ignore
> robot rules parsing for. Use with care and only if you are explicitly
> allowed by the site owner to ignore the site's robots.txt!
> </description>
> </property>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)