Hi Grisha,

On Mon, Sep 08, 2025 at 02:24:50AM -0400, Grisha Levit wrote:
> On Sun, Sep 7, 2025 at 2:46 AM Duncan Roe wrote:
> > `ls -1 [0-5]*` should produce the same output as `ls -1` but instead:-
> [...]
> > superscripts ¹, ² & ³ are missing.
> >
> > My take at an explanation: '₀' - '₉' are Unicode U+2080-9. These display 
> > fine.
> > '⁰' is U+2070 & '⁹' is U+2079, but '¹' is U+00B9, '²' is U+00B2 & '³' is 
> > U+00B3.
>
> This appears to be a bug with the globasciiranges option.
>
> The documentation suggests that enabling this option will disable locale-
> aware collation in range expressions:
>
>       globasciiranges
>           If set, range expressions used in pattern matching  bracket
>           expressions  (see  Pattern  Matching above) behave as if in
>           the traditional C locale when performing comparisons.  That
>           is, pattern matching does not  take  the  current  locale’s
>           collating sequence  into  account,  so  b  will not collate
>           between  A  and  B,  and  upper‐case  and  lower‐case ASCII
>           characters will collate together.
>
> But the implementing code [1] for multibyte locales does the following:
>
>    385  charcmp_wc (wint_t c1, wint_t c2, int forcecoll)
>    ...
>    393    if (forcecoll == 0 && glob_asciirange && c1 <= UCHAR_MAX && c2 <= 
> UCHAR_MAX)
>    394      return ((int)(c1 - c2));
>    ...
>    399    return (wcscoll (s1, s2));
>
> So, in fact, locale-aware collation is disabled only if the range start
> and end codepoints are both in the range U+0001..U+00FF.  This doesn't
> make much sense for codepoints in the range U+0080..U+00FF.
>
> We should either:
>
>   * Remove the <= UCHAR_MAX checks (which would make the behavior match
>     the documentation)
>   * Replace the <= UCHAR_MAX checks with <= 0x7f checks (and update the
>     documentation to note that C locale-style comparisons are done only
>     if both ends of the range are ASCII characters)
>
> [1] 
> https://cgit.git.savannah.gnu.org/cgit/bash.git/tree/lib/glob/smatch.c?h=bash-5.3#n385
>

As I just responded to Oğuz, `ls -1 [i-j]*` shows ⁱ.txt with globasciiranges on,
and 'i' and 'j' are certainly in the range 0001..U+00FF.

Cheers ... Duncan.

Reply via email to