On 2/17/19, Eli Zaretskii <[email protected]> wrote: > I've found a problem with Texinfo 6.6 which happens only when using > TEXINFO_XS_PARSER=1: any UTF-8 encoded text in the Texinfo sources is > corrupted in the output. For example, the UTF-8 sequence \303\240, > which is the encoding of à, becomes \303\203\302\240, and > \342\200\230, which is the encoding of ‘ (left single curved quote) > becomes \303\242\302\200\302\230. Try generating elisp.info from the > latest master branch of the Emacs Git repository to see this. > > If you don't see the problem on your system, try doing the above in a > non-UTF-8 locale (e.g., a Latin-1 locale). If that doesn't succeed in > reproducing the problem, either, it could be Windows specific, but in > that case I will need your guidance for where to look.
It must be a problem that only occurs in certain circumstances. I couldn't reproduce it with the attached file, whether in a UTF-8 locale or a Latin-1 locale. Are you sure that @documentencoding UTF-8 is present in the file? I didn't clone the Emacs repository, but I couldn't see it at http://git.savannah.gnu.org/cgit/emacs.git/tree/doc/lispref/elisp.texi (unless it is from some included file). You could add debugging statements to the code to check what encoding the input is being interpreted as. For example, diff --git a/tp/Texinfo/XS/parsetexi/input.c b/tp/Texinfo/XS/parsetexi/input.c index cad1edc..663e6c1 100644 --- a/tp/Texinfo/XS/parsetexi/input.c +++ b/tp/Texinfo/XS/parsetexi/input.c @@ -196,9 +196,11 @@ convert_to_utf8 (char *s, char *input_encoding) switch (enc) { case ce_utf8: + fprintf(stderr, "converting <%s> from utf-8\n", s); return s; /* no conversion required. */ break; case ce_latin1: + fprintf(stderr, "converting <%s> from latin 1\n", s); our_iconv = iconv_from_latin1; break; case ce_latin2: With the input file I attached, I get: TEXINFO_XS_PARSER=1 ./texi2any.pl test.texi converting <\input texinfo > from latin 1 converting < > from latin 1 converting <@documentencoding UTF-8 > from latin 1 converting < > from utf-8 converting <@node Top > from utf-8 converting <@top > from utf-8 converting < > from utf-8 converting <à > from utf-8 converting < > from utf-8 converting <@bye > from utf-8 Of course, it is possible that something goes wrong somewhere else.
test.info
Description: Binary data
test.texi
Description: TeXInfo document
