On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM <[email protected]> wrote: > > hi, > > i have a page with <meta name="robots" content="noindex,nofollow" />, now i > know that nutch obey to this tag because i dont find the content and the > title in my index, but i was wondering that this document will not be present > in the index. why he keep the document in my index with no title and no > content ?? > > i'm using index-basic and index-more plugins, and i want to understand why > nutch still filling the url, date, boost....etc since he didnt it for title > and content. > > i was thinking that if nutch will obey to nofollow and noindex so it will > skip all the document ! > > or mabe i missunderstood something, can you plz explain this behavior to me? > > best regards. >
My guess is that the page is recorded to note that the page shouldn't be fetched, I'm guessing the status is one of the magic values. It probably re-fetches the page periodically to ensure it has the list. So the URL and the date make sense to me as to why they populate them. I don't know why it is computing the boost, other then the fact that it might be part of the OPIC scoring algorithm. If the scoring algorithm ever uses the scores/boost of the pages that you point at as a contributing factor, it would make total sense. So even though it doesn't index "http://example/foo/bar", knowing which pages point there, and what their scores are could contribute scores of pages that you do index, that contain an outlink to that page. Kirby
