> From: Gavin Smith <[email protected]> > Date: Mon, 18 Feb 2019 00:33:08 +0000 > Cc: [email protected] > > > If you don't see the problem on your system, try doing the above in a > > non-UTF-8 locale (e.g., a Latin-1 locale). If that doesn't succeed in > > reproducing the problem, either, it could be Windows specific, but in > > that case I will need your guidance for where to look. > > It must be a problem that only occurs in certain circumstances.
Indeed, it is triggered by @include'ing a file with @documentencoding. > I couldn't reproduce it with the attached file, whether in a UTF-8 > locale or a Latin-1 locale. Right, because your @documentencoding is in the file where it is needed. > Are you sure that @documentencoding UTF-8 is present in the file? I > didn't clone the Emacs repository, but I couldn't see it at > http://git.savannah.gnu.org/cgit/emacs.git/tree/doc/lispref/elisp.texi > (unless it is from some included file). In the Emacs manuals @documentencoding is in the include file docstyle.texi, which elisp.texi includes. > You could add debugging statements to the code to check what encoding > the input is being interpreted as. For example, Thanks, I think I see the problem. It's because the code manages input_encoding on the input_stack. Which means each included file starts up with input_encoding of zero (which happens to stand for latin-1), and when reading of the include file is exhausted, the code pops input_stack, so any @documentencoding set by an include file is thrown away, and any file included after @documentencoding has its encoding reset to latin-1. But @documentencoding is a global setting, and once set, it should remain in effect for any stuff read thereafter, until it is changed by another @documentencoding, or until EOF. I think this means input_encoding should be part of global_info, not of input_stack. Btw, I think there's a more general issue here. It sounds like in the absence of any @documentencoding directive, the C parser assumes Latin-1, something that doesn't seem to be documented in the Texinfo manual, and perhaps isn't even the best default nowadays. It means, for example, that a document with UTF-8 encoded non-ASCII characters but without @documentencoding will have its non-ASCII characters "converted" on output. Is that the intended behavior, and is it consistent with what the Perl parser does? If so, I think it should be prominently documented, and we should perhaps consider changing the default to UTF-8. Thanks.
