Re: multibyte characters in the Info reader

Gavin Smith Fri, 16 Jan 2026 11:19:50 -0800

On Thu, Jan 15, 2026 at 10:33:58PM +0100, Bruno Haible via Bug reports for the 
GNU Texinfo documentation system wrote:
> Eli Zaretskii wrote:
> > Regarding the encoding of the Info file, it is not a serious problem,
> > because (a) most Info files use UTF-8 anyway, and (b) the Info reader
> > already includes support for re-encoding other codesets to UTF-8
> > (provided that the Info reader is build with libiconv).  So the only
> > case where the encoding of the Info file is relevant is if the Info
> > reader was built without libiconv.
> 
> I see. So the problem is reduced to displaying
>   - (U) UTF-8 text in memory (most frequent case), or
>   - (L) locale-encoded text in memory (only if no iconv API available).


As I understand, it is case (L) that is handled in info.

info attempts to recode Info files to the locale encoding (using the 
iconv function in libc).  It then uses locale-aware functions to process
the contents of Info files.  It does not make explicit use of UTF-8 in
many places.

The proposal to use "libiconv" appears to assume that the target encoding
is always "UTF-8", which would require a slight change in how info loads
Info files: it would not be recoding Info files to the locale encoding,
but to UTF-8 always.  It wasn't clear to me from this discussion whether
people understood that info already uses the iconv function.

It would then requiring rewriting the whole program to use libunistring
instead of libc functions.  (Texinfo already uses a lot of libunistring
via gnulib in texi2any, although that part of the code is completely
separate from info and uses a separate gnulib checkout.)

Eli: what is missing from my understanding of your use case is what
is going on in scan.c:copy_converting, when the Info file is first
read in.  Does conversion of input files to UTF-8, based on the locale,
actually happen?

Can I clarify that "shown as raw bytes" means that they look like
"\302\251", i.e. as backslash escape sequences?

If the iteration over codepoints in printed_representation does not
work, not recognising non-ASCII UTF-8 sequences even though the terminal
supports them, then it would be better to fall back to ASCII substitutes
when the file is first read in.  This would not be the best but would
be better then getting the "\302\251" everywhere.  This would mean using
the degrade_utf8 function in scan.c.  Another possibility is using
the //TRANSLIT flag for an encoding passed to iconv (I didn't know about
this possibility when I wrote the ASCII degradation code, as it wasn't
documented in the libc manual or anywhere else I looked.)

Re: multibyte characters in the Info reader

Reply via email to