On Sat, Apr 18, 2026 at 09:03:28PM +0100, Gavin Smith wrote: > I propose solving this in a more limited way: allow texindex, if possible, > to operate in a UTF-8 mode, regardless of the system locale encoding. > texi2dvi could direct texindex to operate in this mode. > > Arnold Robbins, who wrote the current texindex program, suggested an > option to texindex to operate in such a mode several years ago, although > as far as I know this never progressed and we never found out if it would > really be possible (or if it would cause issues for awk programs that weren't > gawk).
It doesn't seem like it is easy to achieve. gawk works completely within the rules of the current locale, using locale-dependent C functions for everything. Here are the ways forward that I'm aware of: * Change the accented characters in the Texinfo source to use accent commands, e.g. change é to @'e. Then this will be sorted under "E". (From https://lists.gnu.org/archive/html/bug-texinfo/2022-11/msg00023.html.) I know this is not the best solution for users but it may help in the meantime. * Complete rewrite of texindex in a different language. I'm not keen on this idea at all. It would be hard to beat the simplicity of the current awk implementation, which is a self-contained portable awk program of a few hundred lines. Whether it was rewritten in C or Perl with more features, it would likely be longer, harder to read and have many external dependencies. * Try to look for a UTF-8 locale in texi2dvi (using the output from "locale -a") if we detect the Texinfo file is in UTF-8. This would not actually work for mawk, which is an awk implementation which only works with bytes. According to Arnold, the locale may not make any difference for BWK awk either. * Here's my current preferred solution, which should work with any awk (gawk or mawk) regardless of the locale setting, as well as with XeTeX and LuaTeX (which Werner Lemberg reported problems with in 2022): In texinfo.tex, output multibyte UTF-8 sequences with braces around them in the sort key. This works because texindex preserves braced units. $ cat test.texi \input texinfo @cindex à gré, césure @cindex écrire des lettres @cindex bbbb Index: @printindex cp @bye $ cat test.cp @entry{{à} gr{é}, c{é}sure}{1}{à gré, césure} @entry{{é}crire des lettres}{1}{écrire des lettres} @entry{bbbb}{1}{bbbb} $ LC_ALL=C texindex test.cp $ cat test.cps @initial {{à}} @entry{à gré, césure}{1} @initial {{é}} @entry{écrire des lettres}{1} @initial {B} @entry{bbbb}{1} $ Possibly texinfo.tex could be further modified to uppercase é to É (or E if more appropriate). This should be possible in theory as we provide explicit definitions for all the Unicode characters we support (We've got no control over the collation order though - this is a fundamental limitation, but a minor one, in my opinion.) I've made a start on working on this last idea. Here's my current patch to texinfo.tex. I will need to do more work on this before committing anything. diff --git a/doc/texinfo.tex b/doc/texinfo.tex index d429e32031..5aa9996951 100644 --- a/doc/texinfo.tex +++ b/doc/texinfo.tex @@ -5437,6 +5437,7 @@ $$% \extractindexcommands\segment \ifx\indexsortkey\empty{% \indexnonalnumdisappear + \inindexsortkeytrue \xdef\trimmed{\segment}% \xdef\trimmed{\expandafter\eatspaces\expandafter{\trimmed}}% \xdef\indexsortkey{\trimmed}% @@ -10711,6 +10712,14 @@ directory should work if nowhere else does.} \newif\ifutfviiidefinedwarning \utfviiidefinedwarningtrue +\gdef\UTFviiiBracedTwo#1#2{{\string #1\string #2}} +\gdef\UTFviiiBracedThree#1#2#3{{\string #1\string #2\string #3}} +\gdef\UTFviiiBracedFour#1#2#3#4{{\string #1\string #2\string #3\string #4}} + +% We use this with the \ifindexsortkey condition to expand and discard +% an \else block in the containing conditional. +\def\swapnestedfi#1\fi{\fi\expandafter#1\expandafter} + % Give non-ASCII bytes the active definitions for processing UTF-8 sequences \begingroup \catcode`\~13 @@ -10742,7 +10751,9 @@ directory should work if nowhere else does.} \countUTFy = "E0 \def\UTFviiiTmp{% \gdef~{% - \ifpassthroughchars $% + \ifpassthroughchars + \ifinindexsortkey\swapnestedfi\UTFviiiBracedTwo\fi + $% \else\expandafter\UTFviiiTwoOctets\expandafter$\fi}}% \UTFviiiLoop @@ -10750,7 +10761,9 @@ directory should work if nowhere else does.} \countUTFy = "F0 \def\UTFviiiTmp{% \gdef~{% - \ifpassthroughchars $% + \ifpassthroughchars + \ifinindexsortkey\swapnestedfi\UTFviiiBracedThree\fi + $% \else\expandafter\UTFviiiThreeOctets\expandafter$\fi}}% \UTFviiiLoop @@ -10758,7 +10771,9 @@ directory should work if nowhere else does.} \countUTFy = "F4 \def\UTFviiiTmp{% \gdef~{% - \ifpassthroughchars $% + \ifpassthroughchars + \ifinindexsortkey\swapnestedfi\UTFviiiBracedFour\fi + $% \else\expandafter\UTFviiiFourOctets\expandafter$\fi }}% \UTFviiiLoop @@ -11757,6 +11772,9 @@ directory should work if nowhere else does.} \newif\ifpassthroughchars \passthroughcharsfalse +\newif\ifinindexsortkey +\inindexsortkeyfalse + % For native Unicode handling (XeTeX and LuaTeX), % provide a definition macro to replace/pass-through a Unicode character % @@ -11768,7 +11786,11 @@ directory should work if nowhere else does.} \uccode`\~="##2\relax \uppercase{\gdef~}{% \ifpassthroughchars - ##1% + \ifinindexsortkey + {##1}% + \else + ##1% + \fi \else ##3% \fi > > Here was an earlier thread on it: > > https://lists.gnu.org/archive/html/bug-texinfo/2022-11/msg00008.html > From: Werner LEMBERG > Subject: `texindex` output depends on locale settings > Date: Sun, 06 Nov 2022 10:02:44 +0000 (UTC) > > I will try to ask Arnold if he has any ideas how to proceed. > >
