Re: multibyte characters in the Info reader

Bruno Haible via Bug reports for the GNU Texinfo documentation system Thu, 15 Jan 2026 06:53:32 -0800

[CCing bug-gnulib, because it's a question about Gnulib]

Eli Zaretskii wrote:
> I've tried to run the Info reader from Texinfo 7.2.90 after setting
> the console encoding to use UTF-8 (a.k.a. "codepage 65001"), showing
> the ELisp Reference manual (which uses UTF-8 to encode non-ASCII
> characters).  It didn't work as I expected: the UTF-8 sequences were
> shown as raw bytes instead of Unicode characters.
> 
> AFAICT, this happens because display.c:printed_representation doesn't
> regognize UTF-8 byte sequences as such, and instead decides that they
> are just raw bytes.  And that led me to this fragment:
> 
>   const char *
>   printed_representation (mbi_iterator_t *iter, int *delim, size_t pl_chars,
>                         int *pchars, int *pbytes)
>   {
>     struct text_buffer *rep = &printed_rep;
> 
>     char *cur_ptr = (char *) mbi_cur_ptr (*iter);
>     int cur_len = mb_len (mbi_cur (*iter));
> 
>     text_buffer_reset (&printed_rep);
> 
>     if (mb_isprint (mbi_cur (*iter)))
> 
> This uses multibyte iteration and functions/macros like mb_len and
> mb_isprint from Gnulib, and they evidently don't recognize UTF-8
> encoding in this case.  I didn't have time to look closely enough at
> the implementation (which is quite complex, and seems to use 32-bit
> Unicode codepoints and various functions that replace(?) the likes of
> mbrlen and mbrtowc), but it seems to still use mbstate_t as declared
> on the system headers.
> 
> So my question to Bruno is: do the above functions/macros rely on
> mbrlen and mbrtowc from the Windows C runtime, or do they replace them
> from the ground up?
> 
> I'm asking because AFAIK these functions as implemented in the legacy
> MSVCRT run-time library don't support UTF-8 encoding, only the newer
> UCRT runtime does.  And even UCRT only supports UTF-8 when the system
> locale was set to something.UTF-8; just setting the terminal's
> encoding to UTF-8 is not enough.  By contrast, I would like the Info
> reader to be capable of UTF-8 output when it runs on a Windows
> terminal whose encoding is UTF-8, and I'd like to be able to support
> this both in the MSVCRT and UCRT builds of the Info reader.  So if
> Gnulib replaces the above functions with its own (or can potentially
> replace them, given some build-time knobs), I'd like to try that
> and/or fix the code involved to pay attention to the terminal's
> encoding, not just to the current system locale, if that's possible.


You are right regarding the limited support of UTF-8 locales on native
Windows. And the problem is not limited to Windows, it's a basic choice
of programming APIs.

There are three ways to read files that contain multibyte characters:

  (A) Use locale-aware functions.
  (B) Use a specific encoding always (e.g. UTF-8), independently of the locale.
  (C) Support many encodings, independently of the locale.

Approach (A) consists of the function mbrtowc(), the macro MB_CUR_MAX, and
higher layers based on mbrtowc(): mbi_iterator_t etc.

Since locales are defined by the system (and many systems don't have a POSIX
compliant 'localedef' utility), this limits the available encodings:
  - On native Windows with MSVCRT, UTF-8 locales are not supported.
    Not without Gnulib and not with Gnulib, because we can't support
    MB_CUR_MAX == 4 if the system supports only MB_CUR_MAX ≤ 2.
  - On macOS, musl libc, Android, and other platforms, unibyte locales are not
    supported because all locales use UTF-8 or ASCII.
  - On AIX, the system supports UTF-8 locales; but if your sysadmin has only
    installed ISO-8859-1 locales, you are doomed as well.

Approach (B) consists of using e.g. libunistring with the various u8_* 
functions.
This is independent of the locale, but it's only one ASCII-compatible encoding.

Approach (C) generalizes (B) by supporting several encodings. Up to 10 types
of encodings can be supported (unibyte, UTF-8, EUC, EUC-JP, EUC-TW, BIG5,
BIG5-HKSCS, GBK, GB18030, Shift_JIS).

The code in texinfo/info/display.c uses approach (A), with the limitations
mentioned above.

If you want to overcome these limitations, the following questions need
to be answered first:
  - Which text encodings can occur in Info files?
  - Who decides about the text encoding in an Info file?
  - Are there commands for converting an Info file from one encoding to
    another (kind of info-iconv)?

If you want to use approach (B), it's a different API, documented in the
GNU libunistring manual [1] and available through Gnulib modules [2].

If you want to use approach (C), such code exists in GNU gettext (files
po-charset.h, po-charset.c, read-po-internal.h, read-po-lex.c), but it would
need serious refactoring before it could be used outside GNU gettext.

Bruno

[1] https://www.gnu.org/software/libunistring/manual/html_node/index.html
[2] https://www.gnu.org/software/gnulib/manual/html_node/libunistring.html

Re: multibyte characters in the Info reader

Reply via email to