Re: NOINDEX, NOFOLLOW

Kirby Bohling Thu, 10 Dec 2009 11:34:02 -0800

On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM <[email protected]> wrote:
>
> hi,
>
> i have a page with <meta name="robots" content="noindex,nofollow" />, now i 
> know that nutch obey to this tag because i dont find the content and the 
> title in my index, but i was wondering that this document will not be present 
> in the index. why he keep the document in my index with no title and no 
> content ??
>
> i'm using index-basic and index-more plugins, and i want to understand why 
> nutch still filling the url, date, boost....etc since he didnt it for title 
> and content.
>
> i was thinking that if nutch will obey to nofollow and noindex so it will 
> skip all the document !
>
> or mabe i missunderstood something, can you plz explain this behavior to me?
>
> best regards.
>


My guess is that the page is recorded to note that the page shouldn't
be fetched, I'm guessing the status is one of the magic values.  It
probably re-fetches the page periodically to ensure it has the list.
So the URL and the date make sense to me as to why they populate them.
 I don't know why it is computing the boost, other then the fact that
it might be part of the OPIC scoring algorithm.  If the scoring
algorithm ever uses the scores/boost of the pages that you point at as
a contributing factor, it would make total sense.  So even though it
doesn't index "http://example/foo/bar";, knowing which pages point
there, and what their scores are could contribute scores of pages that
you do index, that contain an outlink to that page.

Kirby

Re: NOINDEX, NOFOLLOW

Reply via email to