On Mon, May 25, 2026 at 09:56:35PM +0200, Patrice Dumas wrote:
> On Mon, May 25, 2026 at 07:27:09PM +0100, Gavin Smith wrote:
> > I still don't know what is supposed to make two strings sort in a
> > predictable
> > order if they differ only by case. I checked that the sort keys are
> > identical
> > in that case. Patrice, do you remember anything?
>
> The sorting does not only use the sort key, but also the number of the
> index entry in index and the index names sort order (needed for merged
> indices), as seen in Perl in Indices.pm _sort_index_entries. Therefore,
> the order should be predictable even if the upper and lower case letters
> have the same sort key.
I understand now why the index entries are output in a predicatable order.
However, this means that changing the order of index entries changes
the order in the index, if entries differ only by letter case. For
example:
$ cat test.texi ; TEXINFO_OUTPUT_FORMAT=plaintext texi2any test.texi
\input texinfo
@node Top
@top
@cindex aa
@cindex zz
@cindex @L{}
@cindex @l{}
@printindex cp
@bye
* Menu:
* aa: Top. (line 0)
* Ł: Top. (line 0)
* ł: Top. (line 0)
* zz: Top. (line 0)
$ cat test.texi ; TEXINFO_OUTPUT_FORMAT=plaintext texi2any test.texi
\input texinfo
@node Top
@top
@cindex aa
@cindex zz
@cindex @l{}
@cindex @L{}
@printindex cp
@bye
* Menu:
* aa: Top. (line 0)
* ł: Top. (line 0)
* Ł: Top. (line 0)
* zz: Top. (line 0)
There is also
>
> That being said, I do not know exactly why the strings are upper-cased
> before being sorted. Maybe this is relevant if there is no
> Unicode::Collate sorting (presumably, the lowercase/uppercase sorting is
> done well with Unicode::Collate), as it allows the upper-case and lower
> case letter to be nearby in sort in that case.
Yes, exactly, although it wouldn't make upper case and lower case variants
sort in a consistent order. There may be ways to make that happen using
strcmp comparison: something like:
sort key = uppercase(index entry) . '\x01' . index entry
- i.e., concatenate the uppercased index entry with the original index
entry, with a low valued byte in between. But it is not that important.