According to Jamie Anstice: > Here's a quickie that someone else might like to verify if they've > run into the same problem. When htdig encounters an entity that it > doesn't know about (say ’ - which should really be ’ but > that's another issue) it copies it verbatim to the extract - so far > so good. When the extract is sent out in Display::hilight, the > extract is decoded with HtSGMLCodec to transform the unsigned char > characters to entities, and as well as the characters above 160 it > translates & to &, which is fine except when & is the start of > an entity. This is what leaves things like &146; in extracts. > Here's a patch to HtSGMLCodec::decode to make sure that it doesn't break > real entities.
The problem with this is it doesn't make a distinction between an "&" character in the excerpt that came from a "&" or from a "&" in the source document. For example, if a document gives an example of SGML encoding, and therefore contains something like "&lt;" in the source HTML document, it goes into the excerpt in the database as "<". With your patch, that < in the excerpt doesn't get expanded back to &lt; in the resulting HTML output. I guess it comes down to which is the lesser of two evils, and probably your patch is the better choice. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

