Support for the x-robots-tag HTTP Header
----------------------------------------

                 Key: NUTCH-1257
                 URL: https://issues.apache.org/jira/browse/NUTCH-1257
             Project: Nutch
          Issue Type: New Feature
          Components: fetcher
            Reporter: Mike


Google and Bing both currently support the x-robots-tag HTTP header. This is 
important, because they have a policy of not *crawling* links that are in a 
robots.txt file, and not *indexing* links that are set to noindex. In the case 
that a page is indexed but not crawled, Google and Bing will show the page in 
their results, but it will lack a snippet (since they didn't crawl it, they 
can't generate one). 

As a result, the only way to block Google and Bing from having a page in their 
index is to use the robots meta tag in HTML pages and the x-robots-tag in other 
mimetypes.

As a site owner that needs to keep specific pages private, I *cannot* trust 
robots.txt to keep my pages out of Google and Bing, and I have to use the two 
robots standards. Since Nutch doesn't support the HTTP header, I have to block 
it from crawling ALL non-HTML pages on my site.

This is not an ideal state of affairs, and it would be great if Nutch supported 
the x-robots-tag HTTP header.

I've done more research on this topic on my blog:
 - 
http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag
 - 
http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to