Hi Charlie, IMO if the maintainer doesn't want a page to to be searchable at all the page should be excluded using robots.txt (my intuition). Unfortunately, I cannot tell you how Nutch finally handles such a page in its index.
My two cents, Martin On Dec 19, 2007 1:04 AM, charlie w <[EMAIL PROTECTED]> wrote: > I have a question about the proper interpretation of a noindex robots > directive in a meta tag (<meta name="robots" content="noindex" />). > > When Nutch fetches such a page, the content, title, etc. of the page > is not indexed, but the URL itself is. The document is searchable by > terms in the URL. That is, if the URL of the page is > http://www.mysite.com/onepage.html, the page is be returned as a hit > when searching "onepage". > > Is it correct that Nutch does not index the content but still created > a Lucene document for a page with such a directive? Intuitively it > seems to me as if it should not be searchable at all. > > Thanks, > Charlie >
