On Mon, May 25, 2026 at 10:52:29PM +0200, Patrice Dumas wrote:
> > > That being said, I do not know exactly why the strings are upper-cased
> > > before being sorted.  Maybe this is relevant if there is no
> > > Unicode::Collate sorting (presumably, the lowercase/uppercase sorting is
> > > done well with Unicode::Collate), as it allows the upper-case and lower
> > > case letter to be nearby in sort in that case.
> > 

> ...

> Anyway, should this be added in the TODO?  Or do we consider that it is
> ok and then I can simply add a comment in the code?

I think that we should not uppercase the index entries before getting the
collation key, except in the case of USE_UNICODE_COLLATION=0.  In order of
importance, this raises the question for anyone reading the code of why
it is done, it may be locale dependent (as uppercasing is known to be
potentially locale-dependent), and it may have a performance cost.  It's
confusing as getting case distinctions right is a major part of what
the collation algorithm does.

When using USE_UNICODE_COLLATION=0, it is fine to carry on doing it the way
we are currently doing it - uppercase before getting the sort keys.  There
is no need for any extra complications to always get upper and lower case in
exactly the right order.

> > Yes, exactly, although it wouldn't make upper case and lower case variants
> > sort in a consistent order.  There may be ways to make that happen using
> > strcmp comparison: something like:
> > 
> > sort key = uppercase(index entry) . '\x01' . index entry
> > 
> > - i.e., concatenate the uppercased index entry with the original index
> > entry, with a low valued byte in between.  But it is not that important.
> 
> Is the '\x01' really needed?
> 

That's what the Unicode Collation Algorithm does to get a multi-level
sort.

Suppose uppercase letters sort before lowercase letters.

Then given the three entries to sort:

aa
aaZ
bb

They should sort in that order.

However, using:

sort key = uppercase(index entry) . index entry

would give the sort keys

AAaa
AAZaaZ
BBbb

As Z sorts before a, this sorts to:

AAZaaZ
AAaa
BBbb

so the entries would sort in the order

aaZ
aa
bb

which would be incorrect.

If you put a low valued separator in, like \1, the keys are:

AA\1aa
AAZ\1aaZ
BB\1bb

Now when comparing "AA\1aa" and "AAZ\1aaZ", these are identical up until
the third character, where "\1" sorts earlier than "Z".  So "aa" correctly
sorts before "aaZ".

It's more difficult that you might think.

  • CI: ... Bruno Haible via Bug reports for the GNU Texinfo documentation system
    • ... Patrice Dumas
      • ... Gavin Smith
        • ... Patrice Dumas
    • ... Gavin Smith
      • ... Gavin Smith
        • ... Patrice Dumas
          • ... Gavin Smith
            • ... Patrice Dumas
              • ... Gavin Smith
        • ... Patrice Dumas

Reply via email to