Re: Missing pages & anchor text

Andrzej Bialecki Thu, 31 Aug 2006 08:20:03 -0700

Doug Cook wrote:

I'm thinking I should file issues on the following-


1. The scoring bug. Not sure what to file here, since such things are hard
to pin down. But defining an "inversion" as
        score(hostname/(index|default|home).(html|jsp|asp|cfm|etc)) >
score(hostname)
on a ~2.5Mdoc database, where I have about 8100 such pairs, 6558 were
inversions and only 1585 were "okay." Is this likely to a correct behavior
for OPIC scores? Is this a likely manifestation of a known bug? It doesn't
seem correct, but then, it's early and I still need more coffee ;-) In any
case, this causes the "wrong" versions of the pages to be selected most of
the time during dedup, and I've lost >6500 of the most important, most
anchor-text-rich pages, in my index -- a significant relevance issue.

The default scoring-opic is admittedly buggy (even if the originalalgorithm is suitable for page scoring, which is not obvious at all).However, the inversion problem that you see may stem from the way thesesites are interlinked - perhaps there really is a lot of inlinkspointing to sub-pages instead of roots of the sites?

Anyway, if you feel that shorter urls should get a higher score, thenyou can add a scoring filter to the chain, and in it boost the scorebased on the url length.

2. When "duplicates" really refer to the same page (e.g. X/ vs.
X/index.html) , entries should be merged. Really, these are just
after-the-fact normalizations, but they are a class of normalizations which
can't be done without comparing page fingerprints, since they are not true
for all web servers.

This should already happen when you run DeleteDuplicates (dedup). Dedupselects pages with the same fingerprint, and then retains only newestversion if urls are the same, OR a version with shorter url if urls aredifferent.

3. Redirects. The index keeps the redirect target, but marks the source as
unfetched. This is unfortunate behavior, at least for the class of redirects
where www.x.com redirects to www.x.com/y, which, like the above combination
of issues, causes the root pages, and thus much of the important anchor
text, to be dropped from the index. This seems related to, if not the same
as, NUTCH-273 (https://issues.apache.org/jira/browse/NUTCH-273). I was
simply planning to add these comments to that issue, unless someone hollers.

Yes, as I indicated in that issue, pages we are redirected from shouldbe marked as GONE, and definitely should be marked as fetched. Pleaseadd you comments if any aspect of what you just said is still missingfrom that issue.

For all of the cases where we ignore/drop pages, we should think about what
happens to the inbound anchor text. We should work very very hard to keep
all the anchor text we have, it's by far the most important page feature for
relevance.

Agreed. This may not be so easy in some cases, due to the way Nutchworks at the moment, but we should then discuss how to refactor it tosupport this.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Missing pages & anchor text

Reply via email to