On Sat, Apr 18, 2026 at 09:03:28PM +0100, Gavin Smith wrote:
> I propose solving this in a more limited way: allow texindex, if possible,
> to operate in a UTF-8 mode, regardless of the system locale encoding.
> texi2dvi could direct texindex to operate in this mode.
> 
> Arnold Robbins, who wrote the current texindex program, suggested an
> option to texindex to operate in such a mode several years ago, although
> as far as I know this never progressed and we never found out if it would
> really be possible (or if it would cause issues for awk programs that weren't
> gawk).

It doesn't seem like it is easy to achieve.  gawk works completely within
the rules of the current locale, using locale-dependent C functions for
everything.

Here are the ways forward that I'm aware of:

* Change the accented characters in the Texinfo source to use accent commands,
  e.g. change é to @'e.  Then this will be sorted under "E".  (From
  https://lists.gnu.org/archive/html/bug-texinfo/2022-11/msg00023.html.)  I
  know this is not the best solution for users but it may help in the meantime.

* Complete rewrite of texindex in a different language.  I'm not keen on this
  idea at all.  It would be hard to beat the simplicity of the current awk
  implementation, which is a self-contained portable awk program of a few
  hundred lines.

  Whether it was rewritten in C or Perl with more features, it would
  likely be longer, harder to read and have many external dependencies.

* Try to look for a UTF-8 locale in texi2dvi (using the output from "locale -a")
  if we detect the Texinfo file is in UTF-8.  This would not actually work
  for mawk, which is an awk implementation which only works with bytes.
  According to Arnold, the locale may not make any difference for BWK
  awk either.

* Here's my current preferred solution, which should work with any awk (gawk
  or mawk) regardless of the locale setting, as well as with XeTeX and
  LuaTeX (which Werner Lemberg reported problems with in 2022):

  In texinfo.tex, output multibyte UTF-8 sequences with braces around
  them in the sort key.

  This works because texindex preserves braced units.

$ cat test.texi
\input texinfo

@cindex à gré, césure
@cindex écrire des lettres
@cindex bbbb


Index: 
@printindex cp

@bye
$ cat test.cp
@entry{{à} gr{é}, c{é}sure}{1}{à gré, césure}
@entry{{é}crire des lettres}{1}{écrire des lettres}
@entry{bbbb}{1}{bbbb}
$ LC_ALL=C texindex test.cp
$ cat test.cps
@initial {{à}}
@entry{à gré, césure}{1}
@initial {{é}}
@entry{écrire des lettres}{1}
@initial {B}
@entry{bbbb}{1}
$ 

  Possibly texinfo.tex could be further modified to uppercase é to É
  (or E if more appropriate).  This should be possible in theory as
  we provide explicit definitions for all the Unicode characters we
  support

  (We've got no control over the collation order though - this is a
  fundamental limitation, but a minor one, in my opinion.)


I've made a start on working on this last idea.  Here's my current
patch to texinfo.tex.  I will need to do more work on this before
committing anything.


diff --git a/doc/texinfo.tex b/doc/texinfo.tex
index d429e32031..5aa9996951 100644
--- a/doc/texinfo.tex
+++ b/doc/texinfo.tex
@@ -5437,6 +5437,7 @@ $$%
       \extractindexcommands\segment
       \ifx\indexsortkey\empty{%
         \indexnonalnumdisappear
+        \inindexsortkeytrue
         \xdef\trimmed{\segment}%
         \xdef\trimmed{\expandafter\eatspaces\expandafter{\trimmed}}%
         \xdef\indexsortkey{\trimmed}%
@@ -10711,6 +10712,14 @@ directory should work if nowhere else does.}
 \newif\ifutfviiidefinedwarning
 \utfviiidefinedwarningtrue
 
+\gdef\UTFviiiBracedTwo#1#2{{\string #1\string #2}}
+\gdef\UTFviiiBracedThree#1#2#3{{\string #1\string #2\string #3}}
+\gdef\UTFviiiBracedFour#1#2#3#4{{\string #1\string #2\string #3\string #4}}
+
+% We use this with the \ifindexsortkey condition to expand and discard
+% an \else block in the containing conditional.
+\def\swapnestedfi#1\fi{\fi\expandafter#1\expandafter}
+
 % Give non-ASCII bytes the active definitions for processing UTF-8 sequences
 \begingroup
   \catcode`\~13
@@ -10742,7 +10751,9 @@ directory should work if nowhere else does.}
   \countUTFy = "E0
   \def\UTFviiiTmp{%
     \gdef~{%
-        \ifpassthroughchars $%
+        \ifpassthroughchars
+          \ifinindexsortkey\swapnestedfi\UTFviiiBracedTwo\fi
+          $%
         \else\expandafter\UTFviiiTwoOctets\expandafter$\fi}}%
   \UTFviiiLoop
 
@@ -10750,7 +10761,9 @@ directory should work if nowhere else does.}
   \countUTFy = "F0
   \def\UTFviiiTmp{%
     \gdef~{%
-        \ifpassthroughchars $%
+        \ifpassthroughchars
+          \ifinindexsortkey\swapnestedfi\UTFviiiBracedThree\fi
+          $%
         \else\expandafter\UTFviiiThreeOctets\expandafter$\fi}}%
   \UTFviiiLoop
 
@@ -10758,7 +10771,9 @@ directory should work if nowhere else does.}
   \countUTFy = "F4
   \def\UTFviiiTmp{%
     \gdef~{%
-        \ifpassthroughchars $%
+        \ifpassthroughchars
+          \ifinindexsortkey\swapnestedfi\UTFviiiBracedFour\fi
+          $%
         \else\expandafter\UTFviiiFourOctets\expandafter$\fi
         }}%
   \UTFviiiLoop
@@ -11757,6 +11772,9 @@ directory should work if nowhere else does.}
 \newif\ifpassthroughchars
 \passthroughcharsfalse
 
+\newif\ifinindexsortkey
+\inindexsortkeyfalse
+
 % For native Unicode handling (XeTeX and LuaTeX),
 % provide a definition macro to replace/pass-through a Unicode character
 %
@@ -11768,7 +11786,11 @@ directory should work if nowhere else does.}
         \uccode`\~="##2\relax
         \uppercase{\gdef~}{%
           \ifpassthroughchars
-            ##1%
+            \ifinindexsortkey
+              {##1}%
+            \else
+              ##1%
+            \fi
           \else
             ##3%
           \fi


> 
> Here was an earlier thread on it:
> 
> https://lists.gnu.org/archive/html/bug-texinfo/2022-11/msg00008.html
> From: Werner LEMBERG
> Subject: `texindex` output depends on locale settings
> Date: Sun, 06 Nov 2022 10:02:44 +0000 (UTC)
> 
> I will try to ask Arnold if he has any ideas how to proceed.
> 
> 

Reply via email to