[jira] [Updated] (NUTCH-1257) Support for the x-robots-tag HTTP Header

2013-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1257:
---

Fix Version/s: 1.8

 Support for the x-robots-tag HTTP Header
 

 Key: NUTCH-1257
 URL: https://issues.apache.org/jira/browse/NUTCH-1257
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Mike
Assignee: Markus Jelsma
  Labels: http,, privacy,, robots,
 Fix For: 2.3, 1.8


 Google and Bing both currently support the x-robots-tag HTTP header. This is 
 important, because they have a policy of not *crawling* links that are in a 
 robots.txt file, and not *indexing* links that are set to noindex. In the 
 case that a page is indexed but not crawled, Google and Bing will show the 
 page in their results, but it will lack a snippet (since they didn't crawl 
 it, they can't generate one). 
 As a result, the only way to block Google and Bing from having a page in 
 their index is to use the robots meta tag in HTML pages and the x-robots-tag 
 in other mimetypes.
 As a site owner that needs to keep specific pages private, I *cannot* trust 
 robots.txt to keep my pages out of Google and Bing, and I have to use the two 
 robots standards. Since Nutch doesn't support the HTTP header, I have to 
 block it from crawling ALL non-HTML pages on my site.
 This is not an ideal state of affairs, and it would be great if Nutch 
 supported the x-robots-tag HTTP header.
 I've done more research on this topic on my blog:
  - 
 http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag
  - 
 http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1257) Support for the x-robots-tag HTTP Header

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1257:


Fix Version/s: 2.2
   1.7

 Support for the x-robots-tag HTTP Header
 

 Key: NUTCH-1257
 URL: https://issues.apache.org/jira/browse/NUTCH-1257
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Mike
  Labels: http,, privacy,, robots,
 Fix For: 1.7, 2.2


 Google and Bing both currently support the x-robots-tag HTTP header. This is 
 important, because they have a policy of not *crawling* links that are in a 
 robots.txt file, and not *indexing* links that are set to noindex. In the 
 case that a page is indexed but not crawled, Google and Bing will show the 
 page in their results, but it will lack a snippet (since they didn't crawl 
 it, they can't generate one). 
 As a result, the only way to block Google and Bing from having a page in 
 their index is to use the robots meta tag in HTML pages and the x-robots-tag 
 in other mimetypes.
 As a site owner that needs to keep specific pages private, I *cannot* trust 
 robots.txt to keep my pages out of Google and Bing, and I have to use the two 
 robots standards. Since Nutch doesn't support the HTTP header, I have to 
 block it from crawling ALL non-HTML pages on my site.
 This is not an ideal state of affairs, and it would be great if Nutch 
 supported the x-robots-tag HTTP header.
 I've done more research on this topic on my blog:
  - 
 http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag
  - 
 http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira