Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER

Gavin Smith Sun, 17 Feb 2019 16:34:20 -0800

On 2/17/19, Eli Zaretskii <[email protected]> wrote:
> I've found a problem with Texinfo 6.6 which happens only when using
> TEXINFO_XS_PARSER=1: any UTF-8 encoded text in the Texinfo sources is
> corrupted in the output.  For example, the UTF-8 sequence \303\240,
> which is the encoding of à, becomes \303\203\302\240, and
> \342\200\230, which is the encoding of ‘ (left single curved quote)
> becomes \303\242\302\200\302\230.  Try generating elisp.info from the
> latest master branch of the Emacs Git repository to see this.
>
> If you don't see the problem on your system, try doing the above in a
> non-UTF-8 locale (e.g., a Latin-1 locale).  If that doesn't succeed in
> reproducing the problem, either, it could be Windows specific, but in
> that case I will need your guidance for where to look.


It must be a problem that only occurs in certain circumstances. I
couldn't reproduce it with the attached file, whether in a UTF-8
locale or a Latin-1 locale.

Are you sure that @documentencoding UTF-8 is present in the file? I
didn't clone the Emacs repository, but I couldn't see it at
http://git.savannah.gnu.org/cgit/emacs.git/tree/doc/lispref/elisp.texi
(unless it is from some included file).

You could add debugging statements to the code to check what encoding
the input is being interpreted as. For example,

diff --git a/tp/Texinfo/XS/parsetexi/input.c b/tp/Texinfo/XS/parsetexi/input.c
index cad1edc..663e6c1 100644
--- a/tp/Texinfo/XS/parsetexi/input.c
+++ b/tp/Texinfo/XS/parsetexi/input.c
@@ -196,9 +196,11 @@ convert_to_utf8 (char *s, char *input_encoding)
   switch (enc)
     {
     case ce_utf8:
+      fprintf(stderr, "converting <%s> from utf-8\n", s);
       return s; /* no conversion required. */
       break;
     case ce_latin1:
+      fprintf(stderr, "converting <%s> from latin 1\n", s);
       our_iconv = iconv_from_latin1;
       break;
     case ce_latin2:


With the input file I attached, I get:

 TEXINFO_XS_PARSER=1 ./texi2any.pl test.texi

converting <\input texinfo
> from latin 1
converting <
> from latin 1
converting <@documentencoding UTF-8
> from latin 1
converting <
> from utf-8
converting <@node Top
> from utf-8
converting <@top
> from utf-8
converting <
> from utf-8
converting <à
> from utf-8
converting <
> from utf-8
converting <@bye
> from utf-8


Of course, it is possible that something goes wrong somewhere else.

test.info
Description: Binary data

test.texi
Description: TeXInfo document

Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER

Reply via email to