Doug Cook wrote:
I'm thinking I should file issues on the following-1. The scoring bug. Not sure what to file here, since such things are hard to pin down. But defining an "inversion" as score(hostname/(index|default|home).(html|jsp|asp|cfm|etc)) > score(hostname) on a ~2.5Mdoc database, where I have about 8100 such pairs, 6558 were inversions and only 1585 were "okay." Is this likely to a correct behavior for OPIC scores? Is this a likely manifestation of a known bug? It doesn't seem correct, but then, it's early and I still need more coffee ;-) In any case, this causes the "wrong" versions of the pages to be selected most of the time during dedup, and I've lost >6500 of the most important, most anchor-text-rich pages, in my index -- a significant relevance issue.
The default scoring-opic is admittedly buggy (even if the original algorithm is suitable for page scoring, which is not obvious at all). However, the inversion problem that you see may stem from the way these sites are interlinked - perhaps there really is a lot of inlinks pointing to sub-pages instead of roots of the sites?
Anyway, if you feel that shorter urls should get a higher score, then you can add a scoring filter to the chain, and in it boost the score based on the url length.
2. When "duplicates" really refer to the same page (e.g. X/ vs. X/index.html) , entries should be merged. Really, these are just after-the-fact normalizations, but they are a class of normalizations which can't be done without comparing page fingerprints, since they are not true for all web servers.
This should already happen when you run DeleteDuplicates (dedup). Dedup selects pages with the same fingerprint, and then retains only newest version if urls are the same, OR a version with shorter url if urls are different.
3. Redirects. The index keeps the redirect target, but marks the source as unfetched. This is unfortunate behavior, at least for the class of redirects where www.x.com redirects to www.x.com/y, which, like the above combination of issues, causes the root pages, and thus much of the important anchor text, to be dropped from the index. This seems related to, if not the same as, NUTCH-273 (https://issues.apache.org/jira/browse/NUTCH-273). I was simply planning to add these comments to that issue, unless someone hollers.
Yes, as I indicated in that issue, pages we are redirected from should be marked as GONE, and definitely should be marked as fetched. Please add you comments if any aspect of what you just said is still missing from that issue.
For all of the cases where we ignore/drop pages, we should think about what happens to the inbound anchor text. We should work very very hard to keep all the anchor text we have, it's by far the most important page feature for relevance.
Agreed. This may not be so easy in some cases, due to the way Nutch works at the moment, but we should then discuss how to refactor it to support this.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
