charlie w wrote:
I have a question about the proper interpretation of a noindex robots directive in a meta tag (<meta name="robots" content="noindex" />).
I couldn't find any unambiguous description of this tag in the official documents (robotstxt.org or HTML 4.01). Should a crawler completely skip such a page, including its URL, i.e. to pretend such a page doesn't exist? Or should it skip the content of the page but still recognize that such a page exists?
Nutch does the latter, i.e. it skips the content of the page but still adds a page (without content) to the index.
When Nutch fetches such a page, the content, title, etc. of the page is not indexed, but the URL itself is. The document is searchable by terms in the URL. That is, if the URL of the page is http://www.mysite.com/onepage.html, the page is be returned as a hit when searching "onepage". Is it correct that Nutch does not index the content but still created a Lucene document for a page with such a directive? Intuitively it seems to me as if it should not be searchable at all.
Your intuition may be right, my intuition may be right too .. ;) If you find an official specification that unambiguously defines the expected behavior, we'll comply.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
