[ 
https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083114#comment-15083114
 ] 

Markus Jelsma commented on NUTCH-1257:
--------------------------------------

Hmm, there is no patch but i remember having had this support on our older 
customized Nutch's. Ill look if i can find it again.

> Support for the x-robots-tag HTTP Header
> ----------------------------------------
>
>                 Key: NUTCH-1257
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1257
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Mike
>            Assignee: Markus Jelsma
>              Labels: http,, privacy,, robots,
>             Fix For: 2.4
>
>
> Google and Bing both currently support the x-robots-tag HTTP header. This is 
> important, because they have a policy of not *crawling* links that are in a 
> robots.txt file, and not *indexing* links that are set to noindex. In the 
> case that a page is indexed but not crawled, Google and Bing will show the 
> page in their results, but it will lack a snippet (since they didn't crawl 
> it, they can't generate one). 
> As a result, the only way to block Google and Bing from having a page in 
> their index is to use the robots meta tag in HTML pages and the x-robots-tag 
> in other mimetypes.
> As a site owner that needs to keep specific pages private, I *cannot* trust 
> robots.txt to keep my pages out of Google and Bing, and I have to use the two 
> robots standards. Since Nutch doesn't support the HTTP header, I have to 
> block it from crawling ALL non-HTML pages on my site.
> This is not an ideal state of affairs, and it would be great if Nutch 
> supported the x-robots-tag HTTP header.
> I've done more research on this topic on my blog:
>  - 
> http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag
>  - 
> http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to