[ 
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated NUTCH-1995:
-----------------------------------
    Description: 
The {{http.robot.rules.whitelist}} 
([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration 
parameter allows to specify a comma separated list of hostnames or IP addresses 
to ignore robot rules parsing for.
Adding support for wildcard in {{http.robot.rules.whitelist}} could be very 
useful and simplify the configuration, for example, if we need to give many 
hostnames/addresses. Here is an example:
{noformat}
<name>http.robot.rules.whitelist</name>
  <value>*.sample.com</value>
  <description>Comma separated list of hostnames or IP addresses to ignore 
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>
{noformat}

  was:
The {{http.robot.rules.whitelist}} configuration parameter allows to specify a 
comma separated list of hostnames or IP addresses to ignore robot rules parsing 
for.
Adding support for wildcard in {{http.robot.rules.whitelist}} could be very 
useful and simplify the configuration, for example, if we need to give many 
hostnames/addresses. Here is an example:
{noformat}
<name>http.robot.rules.whitelist</name>
  <value>*.sample.com</value>
  <description>Comma separated list of hostnames or IP addresses to ignore 
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>
{noformat}


> Add support for wildcard to http.robot.rules.whitelist
> ------------------------------------------------------
>
>                 Key: NUTCH-1995
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1995
>             Project: Nutch
>          Issue Type: Improvement
>          Components: robots
>    Affects Versions: 1.10
>            Reporter: Giuseppe Totaro
>
> The {{http.robot.rules.whitelist}} 
> ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration 
> parameter allows to specify a comma separated list of hostnames or IP 
> addresses to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very 
> useful and simplify the configuration, for example, if we need to give many 
> hostnames/addresses. Here is an example:
> {noformat}
> <name>http.robot.rules.whitelist</name>
>   <value>*.sample.com</value>
>   <description>Comma separated list of hostnames or IP addresses to ignore 
>   robot rules parsing for. Use with care and only if you are explicitly
>   allowed by the site owner to ignore the site's robots.txt!
>   </description>
> </property>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to