Hi Charlie,

IMO if the maintainer doesn't want a page to to be searchable at all
the page should be excluded using robots.txt (my intuition).
Unfortunately, I cannot tell you how Nutch finally handles such a page
in its index.


My two cents,

Martin

On Dec 19, 2007 1:04 AM, charlie w <[EMAIL PROTECTED]> wrote:
> I have a question about the proper interpretation of a noindex robots
> directive in a meta tag (<meta name="robots" content="noindex" />).
>
> When Nutch fetches such a page, the content, title, etc. of the page
> is not indexed, but the URL itself is.  The document is searchable by
> terms in the URL.  That is, if the URL of the page is
> http://www.mysite.com/onepage.html, the page is be returned as a hit
> when searching "onepage".
>
> Is it correct that Nutch does not index the content but still created
> a Lucene document for a page with such a directive?  Intuitively it
> seems to me as if it should not be searchable at all.
>
> Thanks,
> Charlie
>

Reply via email to