Hi Stefan, Yes, you're right. The index built without deduping does not have the first instance of the problem (though of course, it's also filled with duplicates, so it has other problems). It still shows the problems with missing redirects, though this could be something else (will investigate that next).
A little digging has turned up more information: 1) Dedup throws away content matches, and decides which one to pick based upon score. This leads it to dump the wrong page, because: http://www.x.com/ score: 1.2 http://www.x.com/index.html score: 1.8 I see two problems. First, there is clearly a scoring problem (possibly my fault somehow; could this have resulted from my failing to build the index properly?). The root page actually has 9 inlinks; the index.html page has none. I can't see anything that would warrant the index.html getting a higher score, even were these actually different pages. Seems like this could be related to the problems you've already discovered. One (perhaps just short term?) possibility would be to use the inbound linkcount for deciding which page becomes the "canonical" version of a duplicate set, since this is probably more stable than the scores. Second, these are in fact the same page. Regardless of which page "wins" by score, dedup should actually merge the two entries since this is a safe normalization, given that we know the content fingerprints are the same. The anchor texts and the scores should be combined. We can't necessarily do this for the general dedup case -- a page shouldn't necessarily benefit just because there are multiple copies of it -- though even there we may be able to combine some anchor text. But in this case these are not multiple copies; they are the same page. In any case, we should work hard not to lose anchor text unless it is completely justified (e.g. for spam). For relevance purposes, anchor text is more important than any other page feature, score included. And especially in our world of small, focused crawls, it is a precious, scarce resource. Thoughts? Comments? -Doug Stefan Groschupf-2 wrote: > > Hi Doug, > I'm pretty sure that your problem is related to the deduping of your > index. > In general the hash of the content of a page is used as key for the > dedub tool. > We ran into the the forwarding problem also in a other case. > https://issues.apache.org/jira/browse/NUTCH-353 > So may be we should think about a general solution of the forwarding > problem. > > Greetings, > Stefan > > > Am 28.08.2006 um 11:33 schrieb Doug Cook: > >> >> Hi, folks, >> >> I have just started digging into relevance issues with Nutch, and I'm >> running into some mysteries. Before I dig too deep, I wanted to >> check to see >> if these were known issues (a quick search of the email archives >> and of JIRA >> didn't turn up anything). I'm running 0.8 with a handful of patches. >> >> I'm frequently finding root pages of sites missing from my index, >> despite >> the fact that they have been fetched. In my admittedly short >> investigation I >> have found two classes of cases: >> >> 1. Root URL is not a redirect, but there is a root-level index.html >> page. >> The index.html page is in the index, but the root page is not. >> Unfortunately, most of the anchor text points to the root page, not >> the >> /index.html page, and the anchor text has gone "missing" along with >> its >> associated page, so relevance is poor. >> >> 2. Root URL is a redirect to another page. Again, this other page >> is in the >> index, the but the root page, along with its anchor text, has gone >> "missing." >> >> I have a deduped index. Both of these cases could result from dedup >> throwing >> out the wrong URL, i.e. the one with more anchor text, although one >> might >> expect dedup to merge the two anchor texts (at least in the case of >> pages >> which commonly normalize to the same URL, e.g. / and /index.html). >> >> The second case might result from the root URL somehow being >> normalized to >> its redirect target, but in that case (incorrect, in any case) I would >> expect the anchor text to also be attached to the redirect target, >> and it is >> not. >> >> I'm about to rebuild with no deduping and see what I find. >> >> Thanks for your help & comments- >> >> Doug >> -- >> View this message in context: http://www.nabble.com/Missing-pages--- >> anchor-text-tf2179049.html#a6025652 >> Sent from the Nutch - Dev forum at Nabble.com. >> >> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 101tec Inc. > Menlo Park, California > http://www.101tec.com > > > > > -- View this message in context: http://www.nabble.com/Missing-pages---anchor-text-tf2179049.html#a6039836 Sent from the Nutch - Dev forum at Nabble.com.
