> From: Gavin Smith <[email protected]>
> Date: Fri, 16 Jan 2026 19:18:23 +0000
> Cc: Eli Zaretskii <[email protected]>, [email protected], [email protected]
>
> On Thu, Jan 15, 2026 at 10:33:58PM +0100, Bruno Haible via Bug reports for
> the GNU Texinfo documentation system wrote:
> > Eli Zaretskii wrote:
> > > Regarding the encoding of the Info file, it is not a serious problem,
> > > because (a) most Info files use UTF-8 anyway, and (b) the Info reader
> > > already includes support for re-encoding other codesets to UTF-8
> > > (provided that the Info reader is build with libiconv). So the only
> > > case where the encoding of the Info file is relevant is if the Info
> > > reader was built without libiconv.
> >
> > I see. So the problem is reduced to displaying
> > - (U) UTF-8 text in memory (most frequent case), or
> > - (L) locale-encoded text in memory (only if no iconv API available).
>
> As I understand, it is case (L) that is handled in info.
>
> info attempts to recode Info files to the locale encoding (using the
> iconv function in libc). It then uses locale-aware functions to process
> the contents of Info files. It does not make explicit use of UTF-8 in
> many places.
Yes.
> The proposal to use "libiconv" appears to assume that the target encoding
> is always "UTF-8", which would require a slight change in how info loads
> Info files: it would not be recoding Info files to the locale encoding,
> but to UTF-8 always. It wasn't clear to me from this discussion whether
> people understood that info already uses the iconv function.
The proposal is to convert the file's text to UTF-8, process it in
UTF-8, then convert to the target encoding when outputting the text to
the screen.
> It would then requiring rewriting the whole program to use libunistring
> instead of libc functions.
Yes.
> Eli: what is missing from my understanding of your use case is what
> is going on in scan.c:copy_converting, when the Info file is first
> read in. Does conversion of input files to UTF-8, based on the locale,
> actually happen?
In the cases I tried, that conversion was not needed: the Info file
was already in UTF-8. (In fact, I'm yet to see an Info file encoded
in some Windows codepage -- it just doesn't happen IME, definitely not
with GNU projects. Either the files are pure ASCII or they are in
UTF-8.)
The current code of the Info reader, after the patches that I
submitted recently, reports "UTF-8" from nl_langinfo when the
terminal's encoding is UTF-8. And since the file is in UTF-8, the
iconv_to_output conversion is a no-op. But the character
classification and iteration in printed_representation are still done,
and they use the locale's encoding, because they eventually call
locale-aware functions from the C runtime.
> Can I clarify that "shown as raw bytes" means that they look like
> "\302\251", i.e. as backslash escape sequences?
Actually, even worse: some look like control characters, some (e.g.,
\200) look like ASCII strings produced to represent non-printable
characters, i.e, with actual ASCII backslash and 3 octal digits.
That's because printed_representation uses the locale-aware functions
from the C runtime, and the locale hasn't been changed to use UTF-8
(and with the older Windows runtime MSVCRT it cannot be changed in
principle, because MSVCRT didn't support UTF-8).
What I wanted to accomplish was simple: have Info interpret the text
as UTF-8, and output it as UTF-8. But because the C runtime functions
like mbrlen and iswprint, which are called by mb_len and mb_isprint,
don't recognize UTF-8, they return results which get in the way.
> If the iteration over codepoints in printed_representation does not
> work, not recognising non-ASCII UTF-8 sequences even though the terminal
> supports them, then it would be better to fall back to ASCII substitutes
> when the file is first read in. This would not be the best but would
> be better then getting the "\302\251" everywhere. This would mean using
> the degrade_utf8 function in scan.c. Another possibility is using
> the //TRANSLIT flag for an encoding passed to iconv (I didn't know about
> this possibility when I wrote the ASCII degradation code, as it wasn't
> documented in the libc manual or anywhere else I looked.)
This already works, and worked in previous versions of Texinfo. If
the terminal's encoding is anything other than UTF-8, the Info reader
degrades the non-ASCII characters to their ASCII equivalents, whether
via //TRANSLIT or degrade_utf8. What I wanted was to allow Info
output the original UTF-8 encoded characters to the Windows terminal
when the terminal's encoding is UTF-8, even if the locale's codeset is
different. And that will not work unless Info will learn to use UTF-8
aware functions from libunistring to handle multibyte characters
_instead_ of the C runtime functions.
So, to summarize:
. there's no regression in the Info reader wrt Texinfo 7.2
. I would like to improve Info to support UTF-8 when possible, but
that requires changes in how Info handles non-ASCII Info files
I hope I've clarified myself.