According to Neal Richter: > > This error is happening in the DISPLAY of the excerpts... so it > > seems like looking for &#XXX; patterns and NOT encoding them before > > display is a reasonable strategy... the browser will decide how to display it.
That would be a reasonable compromise, but note that it is a compromise. For example, if an HTML document has something like "use &#153; in your HTML to encode a ™ character", this will end up in db.excerpts as "use ™ in your HTML to encode a ™ character". At that point, htsearch has no way of knowing that the first occurrence was originally different than the second. It comes down to a decision between encoding both or leaving both as-is. The other option would be for htdig to replace the & lead-in character for undecoded entities into some other, non-ambiguous lead-in character in the database, so that htsearch could always distinguish between the two. But what character could we use, that wouldn't conflict with anything else? > Attn Gilles: > > Display.cc > Wed Mar 1 23:09:49 2000 UTC (3 years, 8 months ago) by grdetil > http://cvs.sourceforge.net/viewcvs.py/htdig/htdig/htsearch/Display.cc?r1=1.100.2.14&r2=1.100.2.15&only_with_tag=htdig-3-2-x > * htsearch/Display.cc (excerpt, hilight): move SGML encoding into > hilight() function, because when it's done earlier it breaks > highlighting of accented characters. > > OK, this is causing the problem.... if I reverse the changes after line > 1284, it will not improperly encode > > ™ --> &#153; > > If we want to highlight acceted characters, it seems like that > <strong>&#XXX</strong> would do the trick. We don't neccessarily need to > convert SGML entities to single chars for the display highlighting to > work... Well, the highlighting itself won't care, but before htsearch highlights a word, it has to find it in the excerpt. It does this using StringMatch. If StringMatch is looking for words with accented letters, it's not going to find them in the excerpt if they've already been SGML encoded there. > Would you forward me an example of what this fix is supposed to do with > an accented charachter? I can redesign this chunk of code to accomplish > both goals. E.g. if you search for something like "réduction", then during the StringMatch on the "head" string, it will find the unencoded word réduction, but if we pre-encode the head string, it won't find the encoded word réduction. My fix was not to pre-encode the head string all at once into SGML entities before highlighting, but rather encode it piece by piece during the highlighting, so that the whole text does eventually get re-encoded. Of course, it needs to be done piece by piece so that the HTML tags for the highlighting, etc., don't get their < and > characters SGML encoded. I fail to see how my fix causes your problem, though. Whether you SGML encode the whole excerpt ("head" string) in one fell swoop before highlighting, or bit by bit during, you're still going to SGML encode all the ampersand characters one way or the other. Unless what you're suggesting is to take out the SGML encoding altogether -- that would be a mistake because then unencoded < and > characters in the excerpt would not get properly encoded and could cause all sorts of problems. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev