On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM <[email protected]> wrote:
>
> hi,
>
> i have a page with <meta name="robots" content="noindex,nofollow" />, now i 
> know that nutch obey to this tag because i dont find the content and the 
> title in my index, but i was wondering that this document will not be present 
> in the index. why he keep the document in my index with no title and no 
> content ??
>
> i'm using index-basic and index-more plugins, and i want to understand why 
> nutch still filling the url, date, boost....etc since he didnt it for title 
> and content.
>
> i was thinking that if nutch will obey to nofollow and noindex so it will 
> skip all the document !
>
> or mabe i missunderstood something, can you plz explain this behavior to me?
>
> best regards.
>

My guess is that the page is recorded to note that the page shouldn't
be fetched, I'm guessing the status is one of the magic values.  It
probably re-fetches the page periodically to ensure it has the list.
So the URL and the date make sense to me as to why they populate them.
 I don't know why it is computing the boost, other then the fact that
it might be part of the OPIC scoring algorithm.  If the scoring
algorithm ever uses the scores/boost of the pages that you point at as
a contributing factor, it would make total sense.  So even though it
doesn't index "http://example/foo/bar";, knowing which pages point
there, and what their scores are could contribute scores of pages that
you do index, that contain an outlink to that page.

Kirby

Reply via email to