Heck, I'm not committed to my intuition; it gets me in trouble all the time ;-)
I was just curious as to whether this behavior was by design. This whole robots thing is pretty un-spec'd as it is. Apparently the big search engines don't agree on this either: http://www.mattcutts.com/blog/handling-noindex-meta-tags/ In my particular case, I want to pretend the page doesn't exist at all. Since I already have my own parse and indexing plugins, it was relatively trivial to cause my crawler to behave the way I want. Thanks Charlie On 12/19/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > charlie w wrote: > > I have a question about the proper interpretation of a noindex robots > > directive in a meta tag (<meta name="robots" content="noindex" />). > > I couldn't find any unambiguous description of this tag in the official > documents (robotstxt.org or HTML 4.01). Should a crawler completely skip > such a page, including its URL, i.e. to pretend such a page doesn't > exist? Or should it skip the content of the page but still recognize > that such a page exists? > > Nutch does the latter, i.e. it skips the content of the page but still > adds a page (without content) to the index. > > > > > When Nutch fetches such a page, the content, title, etc. of the page > > is not indexed, but the URL itself is. The document is searchable by > > terms in the URL. That is, if the URL of the page is > > http://www.mysite.com/onepage.html, the page is be returned as a hit > > when searching "onepage". > > > > Is it correct that Nutch does not index the content but still created > > a Lucene document for a page with such a directive? Intuitively it > > seems to me as if it should not be searchable at all. > > Your intuition may be right, my intuition may be right too .. ;) If you > find an official specification that unambiguously defines the expected > behavior, we'll comply. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
