On Fri, Jan 16, 2026 at 10:19:54PM +0200, Eli Zaretskii wrote:
> > Can I clarify that "shown as raw bytes" means that they look like
> > "\302\251", i.e. as backslash escape sequences?
>
> Actually, even worse: some look like control characters, some (e.g.,
> \200) look like ASCII strings produced to represent non-printable
> characters, i.e, with actual ASCII backslash and 3 octal digits.
> That's because printed_representation uses the locale-aware functions
> from the C runtime, and the locale hasn't been changed to use UTF-8
> (and with the older Windows runtime MSVCRT it cannot be changed in
> principle, because MSVCRT didn't support UTF-8).
>
> What I wanted to accomplish was simple: have Info interpret the text
> as UTF-8, and output it as UTF-8. But because the C runtime functions
> like mbrlen and iswprint, which are called by mb_len and mb_isprint,
> don't recognize UTF-8, they return results which get in the way.
It seems like it would be simple to add code to pass through non-ASCII
bytes to the terminal:
diff --git a/info/display.c b/info/display.c
index 4df6a45063..34deae02ef 100644
--- a/info/display.c
+++ b/info/display.c
@@ -501,7 +501,7 @@ printed_representation (mbi_iterator_t *iter, int *delim,
size_t pl_chars,
text_buffer_reset (&printed_rep);
- if (mb_isprint (mbi_cur (*iter)))
+ if (0 && mb_isprint (mbi_cur (*iter)))
{
/* cur.wc gives a wchar_t object. See mbiter.h in the
gnulib/lib directory. */
@@ -575,6 +575,35 @@ printed_representation (mbi_iterator_t *iter, int *delim,
size_t pl_chars,
}
else
{
+ if (1)
+ {
+ unsigned char c = *cur_ptr;
+ if ((c & 0x80) == 0x00)
+ {
+ /* ASCII */
+ *pchars = 1;
+ *pbytes = 1;
+ ITER_SETBYTES (*iter, 1);
+ return cur_ptr;
+ }
+ if ((c & 0xc0) == 0x80)
+ {
+ /* UTF-8 continuation byte. */
+ *pchars = 0;
+ *pbytes = 1;
+ ITER_SETBYTES (*iter, 1);
+ return cur_ptr;
+ }
+ if ((c & 0xc0) == 0xc0)
+ {
+ /* UTF-8 initial byte. */
+ *pchars = 1;
+ *pbytes = 1;
+ ITER_SETBYTES (*iter, 1);
+ return cur_ptr;
+ }
+ }
+
/* Original byte was not recognized as anything. Display its octal
value. This could happen in the C locale for bytes above 128,
or for bytes 128-159 in an ISO-8859-1 locale. Don't output the bytes
This counts the screen width of all Unicode codepoints as 1 column,
which will nearly always be correct. It should make UTF-8 files display
mostly properly in the MS-Windows terminal that you are using.
We could add an Info variable to customize this behaviour.