Here's a strange little problem that my brother, who operates the
www.muug.mb.ca web site, brought to my attention.  A search for "VNC"
on this site turns up five hits.  One of these includes the highlighted
word VNC as a link to the anchor

        http://www.muug.mb.ca/meetings/97-98.html#jun

as it should, but the hit for http://www.muug.mb.ca/meetings/98-99.html
should do likewise for VNC, which first occurs under the <a name=mar>
anchor.  However, it doesn't.

A bit of trial and error pointed the finger at htmerge.  When I run
htdig on just the meetings sub-directory, I get the following entries for
"vnc" in db.wordlist:

vnc     i:1     l:0     w:150000
vnc     i:0     l:420   w:580
vnc     i:1     l:787   w:1613  c:6     a:8
vnc     i:2     l:818   w:182   a:10

but after htmerge, these remain:

vnc     i:0     l:420   w:580
vnc     i:1     l:0     w:151613        c:7
vnc     i:2     l:818   w:182   a:10

The first vnc record is from the index page, for the description text
of the href to the 98-99.html page.  (Ironically, that href includes
the anchor #mar in it.)  When it's merged with the third vnc record,
the a:8 is lost.  The code in htmerge/words.cc that does this is:

            // OK... Now that we have our new WordRecord parsed
            // Do we (by horrible chance) duplicate the last entry?
            // If we do, update last_word and keep going
            if ((last_wr.id == wr.id)
                && (last_word == word))
              {
#ifndef NO_WORD_COUNT
                last_wr.count += wr.count;
#endif
                last_wr.weight += wr.weight;
                if (wr.location < last_wr.location)
                  last_wr.location = wr.location;
                if (wr.anchor < last_wr.anchor)
                  last_wr.anchor = wr.anchor;
                continue;
              }


I assume the comment above it was written before the anchor stuff
was put in.  Now, it's not a horrible chance, but a normal occurrence.
It seems to me, though, that it shouldn't simply take the minimum anchor,
as it would allow the description text in the index page to override
the anchor in the actual document.

Shouldn't the test be:

                if (wr.anchor > 0 && wr.anchor < last_wr.anchor
                    || last_wr.anchor == 0)
                  last_wr.anchor = wr.anchor;

The assumption here is that if two or more records have the same word
and ID, then one must come from from the actual document with that ID,
and the others are from references to that document.  In this case,
a non-zero anchor number should override, as only the record from the
actual document can have a valid anchor number, and this has already
been determined by htdig to be the lowest number corresponding to the
first occurrence of the word in the document.

Is this assumption correct?  What about in the case of an update dig,
where for example the word VNC is added to 98-99.html before any anchors?
My understanding is that a modified document will be assigned a new
document ID, but what happens to all the description word records that
point to the old document ID?  Have I opened a can of worms here?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to