Hi, folks, I have just started digging into relevance issues with Nutch, and I'm running into some mysteries. Before I dig too deep, I wanted to check to see if these were known issues (a quick search of the email archives and of JIRA didn't turn up anything). I'm running 0.8 with a handful of patches.
I'm frequently finding root pages of sites missing from my index, despite the fact that they have been fetched. In my admittedly short investigation I have found two classes of cases: 1. Root URL is not a redirect, but there is a root-level index.html page. The index.html page is in the index, but the root page is not. Unfortunately, most of the anchor text points to the root page, not the /index.html page, and the anchor text has gone "missing" along with its associated page, so relevance is poor. 2. Root URL is a redirect to another page. Again, this other page is in the index, the but the root page, along with its anchor text, has gone "missing." I have a deduped index. Both of these cases could result from dedup throwing out the wrong URL, i.e. the one with more anchor text, although one might expect dedup to merge the two anchor texts (at least in the case of pages which commonly normalize to the same URL, e.g. / and /index.html). The second case might result from the root URL somehow being normalized to its redirect target, but in that case (incorrect, in any case) I would expect the anchor text to also be attached to the redirect target, and it is not. I'm about to rebuild with no deduping and see what I find. Thanks for your help & comments- Doug -- View this message in context: http://www.nabble.com/Missing-pages---anchor-text-tf2179049.html#a6025652 Sent from the Nutch - Dev forum at Nabble.com.
