Re: multibyte characters in the Info reader

Eli Zaretskii Fri, 16 Jan 2026 11:35:51 -0800

> From: Bruno Haible <[email protected]>
> Cc: [email protected], [email protected]
> Date: Fri, 16 Jan 2026 14:41:00 +0100
> 
> > as the
> > Windows runtime doesn't support well (or not at all) characters beyond
> > the BMP, replacing its standard C functions in Gnulib with versions
> > that accept char32_t codepoints
> 
> This is an effort that is already completed in Gnulib: Since we can't
> redefine the 'wchar_t' type, we had to create another set of functions
> that work on char32_t[] strings. [2][3] A couple of GNU packages already
> make use of it, so as to support characters outside the BMP correctly
> on Cygwin and native Windows.


Then maybe Texinfo could use that as well.

> > and paying attention to the console's
> > output codepage rather than the system locale's codeset
> 
> This is a request of the past. For a couple of years now, the output
> functions in the Microsoft runtime library have automatic conversion
> from the locale encoding to the console's output codepage (e.g.
> from CP1252 to CP850). [4] There is no need any more to care about
> this difference in Gnulib or in GNU packages, except for the workarounds
> mentioned in [4].

I think we are miscommunicating.  I didn't mean to allude to what the
Windows runtime does, I meant to allude to what GNU packages do when
they run on Windows.  While on Posix platforms the terminal's encoding
is (AFAIK) determined only by the locale's codeset, on Windows users
can change the encoding of the terminal without changing the
system-wide locale.  However, many GNU packages, being of Posix
origin, only look at nl_langinfo(CODEEST) when they decide whether
they should use UTF-8.  What I suggest is that console programs which
output text to the terminal pay attention to the terminal's encoded
(via calling GetConsoleOutputCP) in preference to GetACP, and if the
former returns codepage 65001, use UTF-8 internally and for writing to
the console, even though the locale's codeset might not be UTF-8.
This requires use of functions that convert between multibyte and
wide-character representation of text which don't depend on the
Windows runtime, because the Windows runtime won't support UTF-8 if
the system locale is not set to something.UTF-8.

For example, there are programs which refrain from supporting output
of emoji when GetACP returns something other than 65001.  But the
Windows terminal on modern versions of Windows is entirely capable of
displaying emoji if the console's codepage is 65001.  So I'm about to
install changes in GDB that remove this unnecessary limitation when
the console's encoding is UTF-8.

As another example, the next release of GNU Awk will improve support
for Unicode on MS-Windows by using UTF-8 and char32_t internally when
the console's encoding is UTF-8.

I'm saying that other text-mode programs can and should move in this
direction.  But to be able to do so, they need to bypass the Windows
runtime for conversions between multibyte and char32_t representations
and for stuff like collation and character classification, and use
either the Gnulib functions or their own equivalents.

Re: multibyte characters in the Info reader

Reply via email to