bug#11621: questionable locale sorting order (especially as related to char ranges in REs)

Pádraig Brady Mon, 04 Jun 2012 01:49:27 -0700

On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
> 
> 
> Pádraig Brady wrote:
>> On 06/03/2012 11:13 PM, Linda Walsh wrote:
>>> Within in the past few years, use of ranges in RE's has become
>>> unreliable due to some locale changes sorting their native character
>>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>>>
>>> There seems to be a problem in when a user has set their system to use
>>> Unicode, it is no longer using the locale specific character set 
>>> (iso-8859-x,
>>> or others).
> ----
>     To clarify my above statement:
> 
> 
>    There seems to be a problem in when a user has set their system to use
> Unicode: It is no longer using the locale specific character set (iso-8859-x,
> or others) -- ***or*** *their* *orderings*.  I.e. Unicode defines a collation
> order -- I don't know that they others do ('C' does, but I don't know about
> other locale-specific character sets).
> 
> 
>> It's not specific to "unicode". Sorting in a iso-8859-1 charset
>> results in locale ordering:
> ----
>     Can you cite a source specifying the sort/collation order of the
> iso-8859-1 charset that would prove that it is not-conforming to the collation
> specification for that charset?
> 
>     I.e. If there is no official source, then the order with that charset
> is "undefined", and while it may not be desirable, returning a<A<b<B, would 
> not
> be "an error".


It's a charset. Of course the order is defined. Try: man iso-8859-1

The relative ordering can be trivially inferred from the command I presented.
But to be explicit:

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f 
iso-8859-1
a
A
á
b

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=C sort | iconv -f 
iso-8859-1
A
a
b
á

> 
> 
> 
> 
>>> http://unicode.org/charts/case/chart_Latin.htm.
>>
>> http://unicode.org/charts/case/chart_Latin.html
> ---
>     ^^Correct^^ (typho)
> 
>>> Temporarily ignoring accents, only talking about lower and upper
>>> case letters, ...
>>
>> Well case comparison is a complicated area.
> ----
>     A bit, but it's mostly just wrong in the gnu library concerning unicode, 
> and,
> as you are pointing out -- the 'C' encoding as well.
> the 'C' locale was the original charset used by the 'C' language -- only 8 
> bits
> wide.
> 
>     So how can it sort characters beyond the lower 256?
> This would seem to be meaningless and bugs output.

http://www.pixelbeat.org/docs/utf8_programming.html

> Is it?...   When the case comparison ordering is specified in a
> standard, it makes it fairly clear that one is either compliant with the 
> standard
> or not.
> 
>     In this case, the Gnu sort/collation lib is not Unicode/UTF-8 compliant.
> 
>     What happens in other charsets may or may not be covered under some
> other standard -- e.g. the 'C'/ascii ordering is specified.  But I don't know
> if others have relevant standards or not.
> 
>>
>> For the special case of discounting accented chars etc.
>> you can use an attribute of the well designed UTF-8.
> ---
>     This is not exactly the point -- the point is that the core sort
> DOESN'T use that ordering.  That's the bug I am reporting.

Well you can't generally exclude accents.

> 
>     In reporting this, I'm trying to keep the argument 'simple' and focus on
> the problem of widely used ranges in the first 256 code-points of
> Unicode.
> 
>     Unicode gives a fairly extensive algorithm for handling accents,
> but I didn't want to complicate the discussion by "going there".  Please
> focus this bug on the lower 128 code points, as full unicode compliance
> with the full collation algorithm that is specified is likely to be a
> larger task.  HOWEVER, fixing the sorting/collation order of the lower
> 127 code points, is, comparatively a small task that conceivably could be
> fixed in the next release.

lower 127 = ASCII. If your input data is ASCII, just use LC_ALL=C.

>> Enabling traditional byte comparison on (normalized) UTF-8 data
>> will result in data sorted in Unicode code point order:
>> A b a á => A a b á
> 
> But you are missing the point (as well as raising an interesting 
> 'feature'(?bug?)).
> 
> How is it that 'C' collation collates characters that are outside the ascii 
> range?

Well whether C should be a "unicode" or "ascii" charset is a whole different
kettle of fish. I was just referring (as per the link above), that
UTF8 is well designed so that it works with many traditional single byte 
functions.

> I.e. -- you can't interpret input data as 'unicode' in the 'C' locale.
> So how does this work in the 'C' local?  AND more importantly -- it SHOULD 
> work
> when charset is unicode (UTF-8)... and does not.  Test prog:
> ---------------
> #!/bin/bash
> set -m
> # vals to test:
> declare -a vals=( A a B b X x Y y Z z Ⅷ  Ⅴ Ⅲ Ⅰ Ⅿ Ⅽ ⅶ  ⅼ ⅲ )
> COLLATE_ORDER=C
> 
> function isatty {
>     local fd=${1:-1} ;
>     0<&$fd tty -s
> }
> 
> function ord {
>   local nl="";
>     isatty && nl="\n"
>     printf "%d$nl" "'$1"
> }
> 
> function background_print {
>     readarray -t inp
>     for ch in "${inp[@]}"; {
>         printf "%s   (U+%x)\n" "$ch" "$(ord "$ch")"
>     }
> }
> 
> 
> printf "%s\n" "${vals[@]}" |
>         LC_COLLATE=$COLLATE_ORDER sort |
>         background_print
> 
> ------------------------------------
> 
> Note, that the above produces:
> 
> /tmp/stest
> Ⅷ   (U+2167)
> Ⅴ   (U+2164)
> Ⅲ   (U+2162)
> Ⅰ   (U+2160)
> Ⅿ   (U+216f)
> Ⅽ   (U+216d)
> ⅶ   (U+2176)
> ⅼ   (U+217c)
> ⅲ   (U+2172)
> a   (U+61)
> A   (U+41)
> b   (U+62)
> B   (U+42)
> x   (U+78)
> X   (U+58)
> y   (U+79)
> Y   (U+59)
> z   (U+7a)
> Z   (U+5a)
> 
> NOT the output you showed...Seems there's a bug in the C collation order?

Note C doesn't use a collation order, it's simple byte comparison.
Seems there may be a bug in your script?
Also ensure that LC_ALL is not set, which will override LC_COLLATE.

$ printf "%s\n" A a B b 2 1 Ⅷ  ⅶ ⅲ | LC_COLLATE=C sort
1
2
A
B
a
b
Ⅷ
ⅲ
ⅶ

> 
> Changing collation order to UTF-8:
> 
> Same thing:
>  /tmp/stest
> Ⅷ   (U+2167)
> Ⅴ   (U+2164)
> Ⅲ   (U+2162)
> Ⅰ   (U+2160)
> Ⅿ   (U+216f)
> Ⅽ   (U+216d)
> ⅶ   (U+2176)
> ⅼ   (U+217c)
> ⅲ   (U+2172)
> a   (U+61)
> A   (U+41)
> b   (U+62)
> B   (U+42)
> x   (U+78)
> X   (U+58)
> y   (U+79)
> Y   (U+59)
> z   (U+7a)
> Z   (U+5a)
> 
> 
>>> I would assert this is a serious bug that should be addressed ASAP...
>>
>> As for the question in the subject for handling ranges in REs,
>> there has been recent work in changing as you suggest:
>>
>> http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105
> ----
> 
>     Recent?

?

> The most recent posts on that thread look to be from June of last year.
> I.e. a year ago.
> 
> I'm trying to stay focused on specific problems -- UTF-8 ordering is defined.
> the gnu library doesn't follow it.
> 
> Major problem with so many progs relying on the lib!...

cheers,
Pádraig.

bug#11621: questionable locale sorting order (especially as related to char ranges in REs)

Reply via email to