According to [EMAIL PROTECTED]: > I'm using htdig 3.2 on RedHat 7.1 and I have this problem. When I buil word > database and try search my webpages, I get result web page with wrong > non-english characters (for example: �����). When I look at HTML code, I > find that these wrong characters was wrote as "& character" (á > í ...). Where is problem?
This is a known bug in htcommon/HtSGMLCodec.cc. It sets up the same rules for decoding and re-encoding SGML entities. The problem is when you use an 8-bit encoding other than ISO-8859-1 (Latin 1 - Western Europe), the accented characters in the upper half get encoded into SGML entities for the Latin 1 set. The only fix right now is to hack HtSGMLCodec not to do this. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

