Heck, I'm not committed to my intuition; it gets me in trouble all the time ;-)

I was just curious as to whether this behavior was by design.  This
whole robots thing is pretty un-spec'd as it is.  Apparently the big
search engines don't agree on this either:
http://www.mattcutts.com/blog/handling-noindex-meta-tags/

In my particular case, I want to pretend the page doesn't exist at
all.  Since I already have my own parse and indexing plugins, it was
relatively trivial to cause my crawler to behave the way I want.

Thanks
Charlie

On 12/19/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> charlie w wrote:
> > I have a question about the proper interpretation of a noindex robots
> > directive in a meta tag (<meta name="robots" content="noindex" />).
>
> I couldn't find any unambiguous description of this tag in the official
> documents (robotstxt.org or HTML 4.01). Should a crawler completely skip
> such a page, including its URL, i.e. to pretend such a page doesn't
> exist? Or should it skip the content of the page but still recognize
> that such a page exists?
>
> Nutch does the latter, i.e. it skips the content of the page but still
> adds a page (without content) to the index.
>
> >
> > When Nutch fetches such a page, the content, title, etc. of the page
> > is not indexed, but the URL itself is.  The document is searchable by
> > terms in the URL.  That is, if the URL of the page is
> > http://www.mysite.com/onepage.html, the page is be returned as a hit
> > when searching "onepage".
> >
> > Is it correct that Nutch does not index the content but still created
> > a Lucene document for a page with such a directive?  Intuitively it
> > seems to me as if it should not be searchable at all.
>
> Your intuition may be right, my intuition may be right too .. ;) If you
> find an official specification that unambiguously defines the expected
> behavior, we'll comply.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to