> From: Bruno Haible <[email protected]> > Cc: [email protected], [email protected] > Date: Fri, 16 Jan 2026 14:41:00 +0100 > > > as the > > Windows runtime doesn't support well (or not at all) characters beyond > > the BMP, replacing its standard C functions in Gnulib with versions > > that accept char32_t codepoints > > This is an effort that is already completed in Gnulib: Since we can't > redefine the 'wchar_t' type, we had to create another set of functions > that work on char32_t[] strings. [2][3] A couple of GNU packages already > make use of it, so as to support characters outside the BMP correctly > on Cygwin and native Windows.
Then maybe Texinfo could use that as well. > > and paying attention to the console's > > output codepage rather than the system locale's codeset > > This is a request of the past. For a couple of years now, the output > functions in the Microsoft runtime library have automatic conversion > from the locale encoding to the console's output codepage (e.g. > from CP1252 to CP850). [4] There is no need any more to care about > this difference in Gnulib or in GNU packages, except for the workarounds > mentioned in [4]. I think we are miscommunicating. I didn't mean to allude to what the Windows runtime does, I meant to allude to what GNU packages do when they run on Windows. While on Posix platforms the terminal's encoding is (AFAIK) determined only by the locale's codeset, on Windows users can change the encoding of the terminal without changing the system-wide locale. However, many GNU packages, being of Posix origin, only look at nl_langinfo(CODEEST) when they decide whether they should use UTF-8. What I suggest is that console programs which output text to the terminal pay attention to the terminal's encoded (via calling GetConsoleOutputCP) in preference to GetACP, and if the former returns codepage 65001, use UTF-8 internally and for writing to the console, even though the locale's codeset might not be UTF-8. This requires use of functions that convert between multibyte and wide-character representation of text which don't depend on the Windows runtime, because the Windows runtime won't support UTF-8 if the system locale is not set to something.UTF-8. For example, there are programs which refrain from supporting output of emoji when GetACP returns something other than 65001. But the Windows terminal on modern versions of Windows is entirely capable of displaying emoji if the console's codepage is 65001. So I'm about to install changes in GDB that remove this unnecessary limitation when the console's encoding is UTF-8. As another example, the next release of GNU Awk will improve support for Unicode on MS-Windows by using UTF-8 and char32_t internally when the console's encoding is UTF-8. I'm saying that other text-mode programs can and should move in this direction. But to be able to do so, they need to bypass the Windows runtime for conversions between multibyte and char32_t representations and for stuff like collation and character classification, and use either the Gnulib functions or their own equivalents.
