Re: multibyte characters in the Info reader

Eli Zaretskii Fri, 16 Jan 2026 12:20:19 -0800

> From: Gavin Smith <[email protected]>
> Date: Fri, 16 Jan 2026 19:18:23 +0000
> Cc: Eli Zaretskii <[email protected]>, [email protected], [email protected]
> 
> On Thu, Jan 15, 2026 at 10:33:58PM +0100, Bruno Haible via Bug reports for 
> the GNU Texinfo documentation system wrote:
> > Eli Zaretskii wrote:
> > > Regarding the encoding of the Info file, it is not a serious problem,
> > > because (a) most Info files use UTF-8 anyway, and (b) the Info reader
> > > already includes support for re-encoding other codesets to UTF-8
> > > (provided that the Info reader is build with libiconv).  So the only
> > > case where the encoding of the Info file is relevant is if the Info
> > > reader was built without libiconv.
> > 
> > I see. So the problem is reduced to displaying
> >   - (U) UTF-8 text in memory (most frequent case), or
> >   - (L) locale-encoded text in memory (only if no iconv API available).
> 
> As I understand, it is case (L) that is handled in info.
> 
> info attempts to recode Info files to the locale encoding (using the 
> iconv function in libc).  It then uses locale-aware functions to process
> the contents of Info files.  It does not make explicit use of UTF-8 in
> many places.


Yes.

> The proposal to use "libiconv" appears to assume that the target encoding
> is always "UTF-8", which would require a slight change in how info loads
> Info files: it would not be recoding Info files to the locale encoding,
> but to UTF-8 always.  It wasn't clear to me from this discussion whether
> people understood that info already uses the iconv function.

The proposal is to convert the file's text to UTF-8, process it in
UTF-8, then convert to the target encoding when outputting the text to
the screen.

> It would then requiring rewriting the whole program to use libunistring
> instead of libc functions.

Yes.

> Eli: what is missing from my understanding of your use case is what
> is going on in scan.c:copy_converting, when the Info file is first
> read in.  Does conversion of input files to UTF-8, based on the locale,
> actually happen?

In the cases I tried, that conversion was not needed: the Info file
was already in UTF-8.  (In fact, I'm yet to see an Info file encoded
in some Windows codepage -- it just doesn't happen IME, definitely not
with GNU projects.  Either the files are pure ASCII or they are in
UTF-8.)

The current code of the Info reader, after the patches that I
submitted recently, reports "UTF-8" from nl_langinfo when the
terminal's encoding is UTF-8.  And since the file is in UTF-8, the
iconv_to_output conversion is a no-op.  But the character
classification and iteration in printed_representation are still done,
and they use the locale's encoding, because they eventually call
locale-aware functions from the C runtime.

> Can I clarify that "shown as raw bytes" means that they look like
> "\302\251", i.e. as backslash escape sequences?

Actually, even worse: some look like control characters, some (e.g.,
\200) look like ASCII strings produced to represent non-printable
characters, i.e, with actual ASCII backslash and 3 octal digits.
That's because printed_representation uses the locale-aware functions
from the C runtime, and the locale hasn't been changed to use UTF-8
(and with the older Windows runtime MSVCRT it cannot be changed in
principle, because MSVCRT didn't support UTF-8).

What I wanted to accomplish was simple: have Info interpret the text
as UTF-8, and output it as UTF-8.  But because the C runtime functions
like mbrlen and iswprint, which are called by mb_len and mb_isprint,
don't recognize UTF-8, they return results which get in the way.

> If the iteration over codepoints in printed_representation does not
> work, not recognising non-ASCII UTF-8 sequences even though the terminal
> supports them, then it would be better to fall back to ASCII substitutes
> when the file is first read in.  This would not be the best but would
> be better then getting the "\302\251" everywhere.  This would mean using
> the degrade_utf8 function in scan.c.  Another possibility is using
> the //TRANSLIT flag for an encoding passed to iconv (I didn't know about
> this possibility when I wrote the ASCII degradation code, as it wasn't
> documented in the libc manual or anywhere else I looked.)

This already works, and worked in previous versions of Texinfo.  If
the terminal's encoding is anything other than UTF-8, the Info reader
degrades the non-ASCII characters to their ASCII equivalents, whether
via //TRANSLIT or degrade_utf8.  What I wanted was to allow Info
output the original UTF-8 encoded characters to the Windows terminal
when the terminal's encoding is UTF-8, even if the locale's codeset is
different.  And that will not work unless Info will learn to use UTF-8
aware functions from libunistring to handle multibyte characters
_instead_ of the C runtime functions.

So, to summarize:

  . there's no regression in the Info reader wrt Texinfo 7.2
  . I would like to improve Info to support UTF-8 when possible, but
    that requires changes in how Info handles non-ASCII Info files

I hope I've clarified myself.

Re: multibyte characters in the Info reader

Reply via email to