[jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port

Sebastian Nagel (JIRA) Wed, 09 Apr 2014 13:31:16 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964632#comment-13964632
 ]


Sebastian Nagel commented on NUTCH-1752:
----------------------------------------

Yep: Apache httpd and Tomcat on same host with different port (80 vs. 8080). 
Tomcat content not to be crawled ("Disallow: /" in its robots.txt) but linked 
from content on httpd. No question, to use a VirtualHost in this case would be 
more transparent (and the right robots.txt resp. both of them would have been 
fetched).

> cache robots.txt rules per protocol:host:port
> ---------------------------------------------
>
>                 Key: NUTCH-1752
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1752
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.8, 2.2.1
>            Reporter: Sebastian Nagel
>             Fix For: 2.3, 1.9
>
>         Attachments: NUTCH-1752-v1.patch
>
>
> HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host" 
> (before NUTCH-1031 caching was per "host" only). The caching should be per 
> "protocol:host:port". In doubt, a request to a different port may deliver a 
> different {{robots.txt}}. 
> Applying robots.txt rules to a combination of host, protocol, and port is 
> common practice:
> [Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not 
> mention this explicitly (could be derived from examples) but others do:
> * [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: "each protocol and 
> port needs its own robots.txt file"
> * [Google 
> webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]:
>  "The directives listed in the robots.txt file apply only to the host, 
> protocol and port number where the file is hosted."



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port

Reply via email to