Re: multibyte characters in the Info reader

Gavin Smith Wed, 21 Jan 2026 11:58:54 -0800

On Sat, Jan 17, 2026 at 09:28:30AM +0200, Eli Zaretskii wrote:
> If the above doesn't produce any problems (except with occasional wide
> characters), then it's an easy solution, I think.  And we could even
> do better if in the "UTF-8 initial byte" clause we compute the Unicode
> codepoint of the character and call wcwidth (which on Windows will
> call the Gnulib wcwidth and on other systems will DTRT since the above
> code should only be used when the locale's codeset is UTF-8).
> 
> > We could add an Info variable to customize this behaviour.
> 
> That'd be great, thanks.  Would it be possible to add that in this
> release?


Here's a more finished patch.  It would be fine to include this in the
next release if you can confirm that it works acceptably.

wcwidth takes a wchar_t argument and we can't guarantee the format of
this type.  Moreover, in info/pcterm.c, we redefine wcwidth as there
was a performance issue with calling the gnulib definition.  Reading
the UTF-8 sequence, obtaining the codepoint and calling wcwidth seems
to me to be a unnecessary complication for a marginal use case.


diff --git a/info/display.c b/info/display.c
index 4df6a45063..6c71bd9799 100644
--- a/info/display.c
+++ b/info/display.c
@@ -482,6 +482,8 @@ display_process_line (WINDOW *win,
 
 static struct text_buffer printed_rep = { 0 };
 
+int raw_utf8_output_p = 0;
+
 /* Return pointer to string that is the printed representation of character
    (or other logical unit) at ITER if it were printed at screen column
    PL_CHARS.  Use ITER_SETBYTES (util.h) on ITER if we need to advance
@@ -501,7 +503,38 @@ printed_representation (mbi_iterator_t *iter, int *delim, 
size_t pl_chars,
 
   text_buffer_reset (&printed_rep);
 
-  if (mb_isprint (mbi_cur (*iter)))
+  if (raw_utf8_output_p && (unsigned char) *cur_ptr >= 0x80)
+    {
+      /* For systems without a working UTF-8 locale but where UTF-8
+         actually works on the terminal.  This may happen in an MS-Windows
+         UTF-8 terminal with the MSVCRT run-time.
+
+         Pass through UTF-8 bytes to the terminal.  Count each character as
+         a single screen column.  This at least allows viewing (mostly
+         correctly) non-ASCII characters in UTF-8 Info files.
+
+         Searching, user entry etc. of non-ASCII characters may still
+         not work correctly. */
+
+      unsigned char c = *cur_ptr;
+      if ((c & 0xc0) == 0xc0)
+        {
+          /* UTF-8 initial byte. */
+          *pchars = 1;
+          *pbytes = 1;
+          ITER_SETBYTES (*iter, 1);
+          return cur_ptr;
+        }
+      if ((c & 0xc0) == 0x80)
+        {
+          /* UTF-8 continuation byte. */
+          *pchars = 0;
+          *pbytes = 1;
+          ITER_SETBYTES (*iter, 1);
+          return cur_ptr;
+        }
+    }
+  else if (mb_isprint (mbi_cur (*iter)))
     {
       /* cur.wc gives a wchar_t object.  See mbiter.h in the
          gnulib/lib directory. */
diff --git a/info/variables.c b/info/variables.c
index b6d4371de7..e91869ff57 100644
--- a/info/variables.c
+++ b/info/variables.c
@@ -164,6 +164,10 @@ VARIABLE_ALIST info_variables[] = {
       N_("How to print the information line at the start of a node"),
       CHOICES_VAR(nodeline_print, nodeline_choices) },
 
+  { "raw-utf8-output",
+      N_("Always pass through non-ASCII UTF-8 bytes in files to terminal"),
+      ON_OFF_VAR(raw_utf8_output_p) },
+
   { NULL }
 };
 
diff --git a/info/variables.h b/info/variables.h
index 5454ab942e..03d263c6a2 100644
--- a/info/variables.h
+++ b/info/variables.h
@@ -79,6 +79,7 @@ extern int key_time;
 extern int mouse_protocol;
 extern int follow_strategy;
 extern int nodeline_print;
+extern int raw_utf8_output_p;
 
 typedef struct {
     unsigned long mask;

Re: multibyte characters in the Info reader

Reply via email to