On Mon, May 25, 2026 at 06:07:14PM +0100, Gavin Smith wrote:
> 'to_upper_or_lower_multibyte' calls u8_toupper on the index entry text, which
> presumably converts it to upper case.
> 
> Likewise, the Perl code in Texinfo/Indices.pm uses 'uc' to uppercase the
> string before getting the sort key:
> 
>         my $sort_key = $collator->getSortKey(uc($sort_string));
> 
> That would explain why Ł and ł were sometimes in the wrong order, as the
> entries would be indistinguishable if uppercased.  However, why this problem
> is triggered now, and only with Polish ł, is still a mystery.  The 'uc'
> has been there in the Perl code for at least two years (I didn't trace
> back the git history any more).
> 
> Uppercasing the string before getting the sort key should NOT be necessary
> for most of the sorting options.  (It might be useful when not getting
> a sort key at all, when USE_UNICODE_COLLATION is 0.)

I found a bug with the returned length of the sort key, which is very
likely responsible.  The length could be too long, leading uninitialised
bytes after a terminating null to be included in the sort key.  This would
explain why the test failure was hard to reproduce as it depended on
the contents of uninitialised memory.

I fixed this in texi2any in commit 54592bff0a (today).

I still don't know what is supposed to make two strings sort in a predictable
order if they differ only by case.  I checked that the sort keys are identical
in that case.  Patrice, do you remember anything?

  • CI: ... Bruno Haible via Bug reports for the GNU Texinfo documentation system
    • ... Patrice Dumas
      • ... Gavin Smith
        • ... Patrice Dumas
    • ... Gavin Smith
      • ... Gavin Smith
        • ... Patrice Dumas
          • ... Gavin Smith
            • ... Patrice Dumas
              • ... Gavin Smith
        • ... Patrice Dumas

Reply via email to