Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER

Eli Zaretskii Mon, 18 Feb 2019 07:38:13 -0800

> From: Gavin Smith <[email protected]>
> Date: Mon, 18 Feb 2019 00:33:08 +0000
> Cc: [email protected]
> 
> > If you don't see the problem on your system, try doing the above in a
> > non-UTF-8 locale (e.g., a Latin-1 locale).  If that doesn't succeed in
> > reproducing the problem, either, it could be Windows specific, but in
> > that case I will need your guidance for where to look.
> 
> It must be a problem that only occurs in certain circumstances.


Indeed, it is triggered by @include'ing a file with @documentencoding.

> I couldn't reproduce it with the attached file, whether in a UTF-8
> locale or a Latin-1 locale.

Right, because your @documentencoding is in the file where it is
needed.

> Are you sure that @documentencoding UTF-8 is present in the file? I
> didn't clone the Emacs repository, but I couldn't see it at
> http://git.savannah.gnu.org/cgit/emacs.git/tree/doc/lispref/elisp.texi
> (unless it is from some included file).

In the Emacs manuals @documentencoding is in the include file
docstyle.texi, which elisp.texi includes.

> You could add debugging statements to the code to check what encoding
> the input is being interpreted as. For example,

Thanks, I think I see the problem.  It's because the code manages
input_encoding on the input_stack.  Which means each included file
starts up with input_encoding of zero (which happens to stand for
latin-1), and when reading of the include file is exhausted, the code
pops input_stack, so any @documentencoding set by an include file is
thrown away, and any file included after @documentencoding has its
encoding reset to latin-1.  But @documentencoding is a global setting,
and once set, it should remain in effect for any stuff read
thereafter, until it is changed by another @documentencoding, or until
EOF.  I think this means input_encoding should be part of global_info,
not of input_stack.

Btw, I think there's a more general issue here.  It sounds like in the
absence of any @documentencoding directive, the C parser assumes
Latin-1, something that doesn't seem to be documented in the Texinfo
manual, and perhaps isn't even the best default nowadays.  It means,
for example, that a document with UTF-8 encoded non-ASCII characters
but without @documentencoding will have its non-ASCII characters
"converted" on output.  Is that the intended behavior, and is it
consistent with what the Perl parser does?  If so, I think it should
be prominently documented, and we should perhaps consider changing the
default to UTF-8.

Thanks.

Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER

Reply via email to