According to Alexander I. Lebedev:
> I've installed htdig3.20b3 on Linux box and tried to index the files
> in Russian. I'm using locale ru_RU.KOI8-R. The indexing went OK, but
> when I looked at the output in Netscape, I found the text in a different
> encoding like this:
>
> Search results for 'ÐÅÒÅÈÏ&Aunl;'
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> (it seems to be ISO8859-1 encoding or low byte of Unicode).
> The most interesting is when I downgraded to htdig3.1.5 and indexed the
> same files with the similar config file, I was able to see the Russian
> words in Netscape.
>
> Is there a way to solve the problem?
Well, this is a rather nasty bug, and it's going to take some changes
to the 3.2 htsearch code to fix it. 3.1.5 handles SGML entities
differently than 3.2 does. Specifically, 3.2's code is more "aggressive"
in translating decoded characters back into SGML entities. 3.1.5 only
translates the '<', '>', '&', '"' and non-breaking space back into their
SGML encodings, while 3.2.0b* seems to convert all characters from the
upper-half of the set to their SGML entities. Unfortunately, in doing so
it disregards the fact that the internal encodings are locale-specific,
and not guaranteed to be ISO-8859-1 encodings.
I suppose the easy fix would be to take the encodeSGML() function from
3.1.5's htsearch/Display.cc, and use it in 3.2 instead of the HtSGMLCodec
class which it uses now. The other fix would be to change HtSGMLCodec
to have a more limited set of reverse translations. I think the second
approach is preferable, but I'm not quite sure what's the best way to
do this.
At this point, I'm not even sure that 3.1.5's assumption, that character
160 always represents , is correct for all locales' character sets.
Can anyone confirm this?
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html