Gilles,
Good point but for Mich<sup>l</sup> I would add entries in the
synonyms database for Mich Michael and Michl since they are all the same
word. For that example, there's no reason to want to search for each entry
individually. I would vote for treating the <sup> and <sub> as hyphens but
I can see where this might cause trouble (mathematical equations might be
one such case).
At 09:08 AM 4/11/02, Gilles Detillieux wrote:
>According to Greg Lepore:
> > It appears that HTDIG 3.1.6 handles superscripts by inserting a space for
> > the <SUP> opening tag, which causes the search to not find words which
> > contain the superscript. Example: Cap<SUP>t</SUP> will return as "Cap t"
> > so a search for capt will not work. Is HTDIG properly handling the <SUP>
> > tag? An example is at:
> >
>
>http://www.mdarchives.state.md.us/megafile/msa/speccol/sc2900/sc2908/000001/000011/html/am11--554.html
>
>When I added support for handling <sup> and <sub> tags, I had to decide
>whether they should cause a word break or not. The problem before was
>that they caused a word break in the words going into the word database,
>but not the excerpt, so matched words weren't highlighted in the excerpts
>if they were juxtaposed to a superscript or subscript. I set out to make
>it consistent, but had to decide which of the two behaviours to choose.
>
>In most of the uses of superscripts and subscripts I've seen, it makes
>more sense to treat them as causing a word break, so that's what I decided
>to do. The URL above is a good example of both uses of superscripts.
>When htdig sees Mich<sup>1</sup> or Nath<sup>1</sup>, it makes sense to
>have a word break at the <sup> tag, so that you can search for mich or
>nath instead of mich1 or nath1. However, for Capt<sup>n</sup>, you'd
>want a search for captn to find it. It occurs to me that this is a lot
>like the dilemma of valid_punctuation, where I fixed the code to take a
>word like post-doctoral and index it as post, doctoral and postdoctoral.
>Maybe <sup> and <sub> should be treated as a hyphen rather than a
>word break. That would be good for uses like 2<sup>nd</sup> too.
>Can anyone think of counterarguments to this?
>
>--
>Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
>Spinal Cord Research
>Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
>Dept. Physiology, U. of Manitoba Phone: (204)789-3766
>Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
Gregory Lepore
Webmaster, State of Maryland
Supervisor, Archives of Maryland Online
410-260-6425
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html