[ https://issues.apache.org/jira/browse/NUTCH-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel reopened NUTCH-2646: ------------------------------------ > CLONE - Caching of redirected robots.txt may overwrite correct robots.txt > rules > ------------------------------------------------------------------------------- > > Key: NUTCH-2646 > URL: https://issues.apache.org/jira/browse/NUTCH-2646 > Project: Nutch > Issue Type: Bug > Components: fetcher, robots > Affects Versions: 2.3.1, 1.14 > Reporter: Chang Fan > Assignee: Sebastian Nagel > Priority: Critical > > Redirected robots.txt rules are also cached for the target host. That may > cause that the correct robots.txt rules are never fetched. E.g., > http://wyomingtheband.com/robots.txt redirects to > https://www.facebook.com/wyomingtheband/robots.txt. Because fetching fails > with a 404 bots are allowed to crawl wyomingtheband.com. The rules is > erroneously also cached for the redirect target host www.facebook.com which > is clear regarding their [robots.txt|https://www.facebook.com/robots.txt] > rules and does not allow crawling. > Nutch may cache redirected robots.txt rules only if the path part (in doubt, > including the query) of the redirect target URL is exactly {{/robots.txt}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)