On 2009-12-10 20:33, Kirby Bohling wrote:
On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM<[email protected]>  wrote:

hi,

i have a page with<meta name="robots" content="noindex,nofollow" />, now i know 
that nutch obey to this tag because i dont find the content and the title in my index, but i was 
wondering that this document will not be present in the index. why he keep the document in my index with 
no title and no content ??

i'm using index-basic and index-more plugins, and i want to understand why 
nutch still filling the url, date, boost....etc since he didnt it for title and 
content.

i was thinking that if nutch will obey to nofollow and noindex so it will skip 
all the document !

or mabe i missunderstood something, can you plz explain this behavior to me?

best regards.


My guess is that the page is recorded to note that the page shouldn't
be fetched, I'm guessing the status is one of the magic values.  It
probably re-fetches the page periodically to ensure it has the list.
So the URL and the date make sense to me as to why they populate them.
  I don't know why it is computing the boost, other then the fact that
it might be part of the OPIC scoring algorithm.  If the scoring
algorithm ever uses the scores/boost of the pages that you point at as
a contributing factor, it would make total sense.  So even though it
doesn't index "http://example/foo/bar";, knowing which pages point
there, and what their scores are could contribute scores of pages that
you do index, that contain an outlink to that page.

Very good explanation, that's exactly the reasons why Nutch never discards such pages. If you really want to ignore certain pages, then use URLFilters and/or ScoringFilters.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to