Re: [htdig-dev] Numbered HTML Entities mangled in Result Blurbs

Gilles Detillieux Wed, 19 Nov 2003 09:34:53 -0800

According to Neal Richter:
> >   This error is happening in the DISPLAY of the excerpts... so it
> > seems like looking for &#XXX; patterns and NOT encoding them before
> > display is a reasonable strategy... the browser will decide how to display it.


That would be a reasonable compromise, but note that it is a compromise.
For example, if an HTML document has something like "use &amp;#153; in
your HTML to encode a &#153; character", this will end up in db.excerpts
as "use &#153; in your HTML to encode a &#153; character".  At that point,
htsearch has no way of knowing that the first occurrence was originally
different than the second.  It comes down to a decision between encoding
both or leaving both as-is.

The other option would be for htdig to replace the & lead-in character
for undecoded entities into some other, non-ambiguous lead-in character in
the database, so that htsearch could always distinguish between the two.
But what character could we use, that wouldn't conflict with anything
else?

> Attn Gilles:
> 
> Display.cc
> Wed Mar 1 23:09:49 2000 UTC (3 years, 8 months ago) by grdetil
> http://cvs.sourceforge.net/viewcvs.py/htdig/htdig/htsearch/Display.cc?r1=1.100.2.14&r2=1.100.2.15&only_with_tag=htdig-3-2-x
> * htsearch/Display.cc (excerpt, hilight): move SGML encoding into
>   hilight() function, because when it's done earlier it breaks
>   highlighting of accented characters.
> 
> OK, this is causing the problem.... if I reverse the changes after line
> 1284, it will not improperly encode
> 
> &#153; --> &amp;#153;
> 
> If we want to highlight acceted characters, it seems like that
> <strong>&#XXX</strong> would do the trick.  We don't neccessarily need to
> convert SGML entities to single chars for the display highlighting to
> work...

Well, the highlighting itself won't care, but before htsearch highlights
a word, it has to find it in the excerpt.  It does this using StringMatch.
If StringMatch is looking for words with accented letters, it's not going
to find them in the excerpt if they've already been SGML encoded there.

> Would you forward me an example of what this fix is supposed to do with
> an accented charachter?  I can redesign this chunk of code to accomplish
> both goals.

E.g. if you search for something like "réduction", then during the
StringMatch on the "head" string, it will find the unencoded word
réduction, but if we pre-encode the head string, it won't find the
encoded word r&eacute;duction.

My fix was not to pre-encode the head string all at once into SGML
entities before highlighting, but rather encode it piece by piece during
the highlighting, so that the whole text does eventually get re-encoded.
Of course, it needs to be done piece by piece so that the HTML tags for
the highlighting, etc., don't get their < and > characters SGML encoded.

I fail to see how my fix causes your problem, though.  Whether you
SGML encode the whole excerpt ("head" string) in one fell swoop before
highlighting, or bit by bit during, you're still going to SGML encode
all the ampersand characters one way or the other.  Unless what you're
suggesting is to take out the SGML encoding altogether -- that would be
a mistake because then unencoded < and > characters in the excerpt would
not get properly encoded and could cause all sorts of problems.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Numbered HTML Entities mangled in Result Blurbs

Reply via email to