bug#11621: questionable locale sorting order (especially as related to char ranges in REs)

Pádraig Brady Sun, 03 Jun 2012 15:58:18 -0700

On 06/03/2012 11:13 PM, Linda Walsh wrote:
> Within in the past few years, use of ranges in RE's has become
> unreliable due to some locale changes sorting their native character
> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
> 
> Additionally many distro's have switched to UTF-8 resulting in
> localizations like en_GB.UTF-8, en_US.UTF-8, etc...
> 
> There seems to be a problem in when a user has set their system to use
> Unicode, it is no longer using the locale specific character set (iso-8859-x,
> or others).


It's not specific to "unicode". Sorting in a iso-8859-1 charset
results in locale ordering:

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f 
iso-8859-1
a
A
á
b

> In Unicode, it is recommended that upper case be uniformly sorted
> below lower case (section 6.6, http://www.unicode.org/reports/tr10/).
> 
> A chart, including accent variations is at
> 
> http://unicode.org/charts/case/chart_Latin.htm.

http://unicode.org/charts/case/chart_Latin.html

> Temporarily ignoring accents, only talking about lower and upper
> case letters, you will note that the sorting order of A=41, B=42, C=43,
> while the lower case letters from 'a', have weights a=61, b=62, c=63.
> 
> This uniformly puts all lower case letters "after" any upper case letters.
> 
> Thus -- I am asserting, that any computer using a locale for country
> preferences, BUT is also using a unicode character set (e.g. UTF-8),
> should return sorted results as specified by the character set.
> 
> I.e. the utility 'sort' (and any programs that use the collation/sorting
> order specified in the core-utils libs) should return A-Z < a-z.

Well case comparison is a complicated area.

For the special case of discounting accented chars etc.
you can use an attribute of the well designed UTF-8.
Enabling traditional byte comparison on (normalized) UTF-8 data
will result in data sorted in Unicode code point order:

$ printf "%s\n" A b a á | LC_ALL=C sort
A
a
b
á

> This is currently not the case and is leading to erroneous results
> in programs written before locales were considered.  The thing is --
> in many cases, within some short period of locales being implemented,
> many or most distro's also switched to UTF-8.
> 
> Unfortunately it's collation order has not been respected.
> 
> I would assert this is a serious bug that should be addressed ASAP...

As for the question in the subject for handling ranges in REs,
there has been recent work in changing as you suggest:

http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105

cheers,
Pádraig.

bug#11621: questionable locale sorting order (especially as related to char ranges in REs)

Reply via email to