[
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Giuseppe Totaro updated NUTCH-1995:
-----------------------------------
Description:
The {{http.robot.rules.whitelist}}
([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration
parameter allows to specify a comma separated list of hostnames or IP addresses
to ignore robot rules parsing for.
Adding support for wildcard in {{http.robot.rules.whitelist}} could be very
useful and simplify the configuration, for example, if we need to give many
hostnames/addresses. Here is an example:
{noformat}
<name>http.robot.rules.whitelist</name>
<value>*.sample.com</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
{noformat}
was:
The {{http.robot.rules.whitelist}} configuration parameter allows to specify a
comma separated list of hostnames or IP addresses to ignore robot rules parsing
for.
Adding support for wildcard in {{http.robot.rules.whitelist}} could be very
useful and simplify the configuration, for example, if we need to give many
hostnames/addresses. Here is an example:
{noformat}
<name>http.robot.rules.whitelist</name>
<value>*.sample.com</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
{noformat}
> Add support for wildcard to http.robot.rules.whitelist
> ------------------------------------------------------
>
> Key: NUTCH-1995
> URL: https://issues.apache.org/jira/browse/NUTCH-1995
> Project: Nutch
> Issue Type: Improvement
> Components: robots
> Affects Versions: 1.10
> Reporter: Giuseppe Totaro
>
> The {{http.robot.rules.whitelist}}
> ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration
> parameter allows to specify a comma separated list of hostnames or IP
> addresses to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very
> useful and simplify the configuration, for example, if we need to give many
> hostnames/addresses. Here is an example:
> {noformat}
> <name>http.robot.rules.whitelist</name>
> <value>*.sample.com</value>
> <description>Comma separated list of hostnames or IP addresses to ignore
> robot rules parsing for. Use with care and only if you are explicitly
> allowed by the site owner to ignore the site's robots.txt!
> </description>
> </property>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)