On Mon, May 25, 2026 at 10:52:29PM +0200, Patrice Dumas wrote: > > > That being said, I do not know exactly why the strings are upper-cased > > > before being sorted. Maybe this is relevant if there is no > > > Unicode::Collate sorting (presumably, the lowercase/uppercase sorting is > > > done well with Unicode::Collate), as it allows the upper-case and lower > > > case letter to be nearby in sort in that case. > >
> ... > Anyway, should this be added in the TODO? Or do we consider that it is > ok and then I can simply add a comment in the code? I think that we should not uppercase the index entries before getting the collation key, except in the case of USE_UNICODE_COLLATION=0. In order of importance, this raises the question for anyone reading the code of why it is done, it may be locale dependent (as uppercasing is known to be potentially locale-dependent), and it may have a performance cost. It's confusing as getting case distinctions right is a major part of what the collation algorithm does. When using USE_UNICODE_COLLATION=0, it is fine to carry on doing it the way we are currently doing it - uppercase before getting the sort keys. There is no need for any extra complications to always get upper and lower case in exactly the right order. > > Yes, exactly, although it wouldn't make upper case and lower case variants > > sort in a consistent order. There may be ways to make that happen using > > strcmp comparison: something like: > > > > sort key = uppercase(index entry) . '\x01' . index entry > > > > - i.e., concatenate the uppercased index entry with the original index > > entry, with a low valued byte in between. But it is not that important. > > Is the '\x01' really needed? > That's what the Unicode Collation Algorithm does to get a multi-level sort. Suppose uppercase letters sort before lowercase letters. Then given the three entries to sort: aa aaZ bb They should sort in that order. However, using: sort key = uppercase(index entry) . index entry would give the sort keys AAaa AAZaaZ BBbb As Z sorts before a, this sorts to: AAZaaZ AAaa BBbb so the entries would sort in the order aaZ aa bb which would be incorrect. If you put a low valued separator in, like \1, the keys are: AA\1aa AAZ\1aaZ BB\1bb Now when comparing "AA\1aa" and "AAZ\1aaZ", these are identical up until the third character, where "\1" sorts earlier than "Z". So "aa" correctly sorts before "aaZ". It's more difficult that you might think.
