Re: Missing pages & anchor text

Doug Cook Thu, 31 Aug 2006 10:00:17 -0700

Hi, Andrzej.

Thanks for the quick response!

> Andrzej Bialecki wrote:
> Doug Cook wrote:
> > I'm thinking I should file issues on the following-
> >
> > 1. The scoring bug. Not sure what to file here, since such things are
> hard
> > to pin down. But defining an "inversion" as
> >         score(hostname/(index|default|home).(html|jsp|asp|cfm|etc)) >
> > score(hostname)
> > on a ~2.5Mdoc database, where I have about 8100 such pairs, 6558 were
> > inversions and only 1585 were "okay." Is this likely to a correct
> behavior
> > for OPIC scores? Is this a likely manifestation of a known bug? It
> doesn't
> > seem correct, but then, it's early and I still need more coffee ;-) In
> any
> > case, this causes the "wrong" versions of the pages to be selected most
> of
> > the time during dedup, and I've lost >6500 of the most important, most
> > anchor-text-rich pages, in my index -- a significant relevance issue.
> >  
> 
> The default scoring-opic is admittedly buggy (even if the original
> algorithm is suitable for page scoring, which is not obvious at all).
> However, the inversion problem that you see may stem from the way these
> sites are interlinked - perhaps there really is a lot of inlinks
> pointing to sub-pages instead of roots of the sites?

I thought of that, but at least at a cursory examination, the root pages I
looked at had more inbound anchor text, which leads me to believe that they
have more (at least external) links. I'll investigate further and let you
know what I find.

> Anyway, if you feel that shorter urls should get a higher score, then
> you can add a scoring filter to the chain, and in it boost the score
> based on the url length.

I'm not sure that "shorter URLs" is necessarily the right way to do it.
Within a host, that probably works fairly well. But imagine a host X and its
mirror X' - one of these two will generally be the "canonical" form of the
hostname, and it may or may not be the one with the shorter name. The more
linked-to version is probably the right one. Though perhaps the way to solve
that is to think about it as a normalization problem, and build a fast
"mirror table" into the normalizer. And as we dedup, if we see a lot of
duplicate sets of the form:
    X/blahblah
    X'/blahblah
then we identify (X,X') as "mirror candidates" and put them in the mirror
table with a hypothesis for which is the canonical version. Then we never
have to dedup them again, and all of the anchor text issues are solved as
well.

> > 2. When "duplicates" really refer to the same page (e.g. X/ vs.
> > X/index.html) , entries should be merged. Really, these are just
> > after-the-fact normalizations, but they are a class of normalizations
> which
> > can't be done without comparing page fingerprints, since they are not
> true
> > for all web servers.
> >  
> 
> This should already happen when you run DeleteDuplicates (dedup). Dedup
> selects pages with the same fingerprint, and then retains only newest
> version if urls are the same, OR a version with shorter url if urls are
> different.

I'm not sure I follow you.  I thought that dedup used the score -- in which
case the www.x.com/index.html will win out over www.x.com/ when one of
aforementioned "score inversions" takes place. And I also thought that the
"losing" URL was simply dropped, thus effectively losing its anchor text.
What I meant when I said "merged" above was that the anchor text from the
"losing" version of the URL is effectively merged into that of the "winning"
URL when the two are found to be not just copies of the same document, but
actually the same document, so that no anchor text is lost. To give a
concrete example:

pre-dedup:
http://www.x.com/
    12 inbound links & anchor text
http://www.x.com/index.html
    3 inbound links & anchor text

post-dedup (ideally):
http://www.x.com/
    15 inbound links & anchor text

post-dedup (currently):
http://www.x.com/index.html
    3 inbound links & anchor text

Something more or less identical to this should also happen in the (fairly
common) case where the root page is a redirect to an "internal home page."
(as, for example, on http://www.diageo.com/; see below in the redirect
discussion).  We may also want to do something like this for the case of
site mirrors -- handling these as normalizations would automatically do
this.

Please pardon me if I'm misunderstanding -- I'm just going from the behavior
I see and the documentation & code comments; I haven't yet done a detailed
read-through of the code!

> > 3. Redirects. The index keeps the redirect target, but marks the source
> as
> > unfetched. This is unfortunate behavior, at least for the class of
> redirects
> > where www.x.com redirects to www.x.com/y, which, like the above
> combination
> > of issues, causes the root pages, and thus much of the important anchor
> > text, to be dropped from the index. This seems related to, if not the
> same
> > as, NUTCH-273 (https://issues.apache.org/jira/browse/NUTCH-273). I was
> > simply planning to add these comments to that issue, unless someone
> hollers.
> >  
> 
> Yes, as I indicated in that issue, pages we are redirected from should
> be marked as GONE, and definitely should be marked as fetched. Please
> add you comments if any aspect of what you just said is still missing
> from that issue.

A redirect origin should not necessarily be considered GONE. In many cases,
the redirect origin is the "canonical version" of the page, and the target
is the "transitory version," as with most internal root-page redirects (see
the Diageo example above). We should keep those versions of the page. If a
user searches for Diageo, they expect to get www.diageo.com, not some long
complicated subpage URL.

Again, just to clarify what is happening here, I'm seeing something like:

In my crawl/link databases:
    http://www.diageo.com/
        30 anchor text strings
        Marked as UNFETCHED because it is a redirect
    http://www.diageo.com/en-row/homepage.htm
        3 anchor text strings
        Marked as FETCHED

In an IDEAL index:
    http://www.diageo.com/
    (and maybe http://www.diageo.com/en-row/homepage.htm indexed as an alias
of this)
        33 inbound anchor text strings

In my CURRENT index:
    http://www.diageo.com/en-row/homepage.htm
        3 anchor text strings:

This is obviously not optimal for relevance!

I like Doug C's -- oh shoot, too many Doug Cs around here! ;-) -- Doug
Cutting's idea (on NUTCH-273) that we want to remember all of the redirects
to a given page at index time. We should also remember all of the
metadata/anchor text for those pages, and then we should make an intelligent
decision at index time about which anchor text to include, and even what
URL(s) to call this within the index. Thus we could arrive at the "ideal"
index above.

I'm struggling to get the most out of the meager anchor text in my
relatively small index. Handling dups, mirrors, and redirects in a way that
allows us to use all of the anchor text will be a significant relevance
boost.  Thanks for listening to my rant -- and apologies again for any
misunderstandings I may have, I'm getting up the (steep) Nutch learning
curve.

Doug
-- 
View this message in context: 
http://www.nabble.com/Missing-pages---anchor-text-tf2179049.html#a6083655
Sent from the Nutch - Dev forum at Nabble.com.

Re: Missing pages & anchor text

Reply via email to