bug#11621: questionable locale sorting order (especially as related to char ranges in REs)

Linda Walsh Sun, 03 Jun 2012 15:14:20 -0700

Within in the past few years, use of ranges in RE's has become
unreliable due to some locale changes sorting their native character
sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).


Additionally many distro's have switched to UTF-8 resulting in
localizations like en_GB.UTF-8, en_US.UTF-8, etc...

There seems to be a problem in when a user has set their system to use

Unicode, it is no longer using the locale specific character set(iso-8859-x,

or others).

In Unicode, it is recommended that upper case be uniformly sorted
below lower case (section 6.6, http://www.unicode.org/reports/tr10/).

A chart, including accent variations is at

http://unicode.org/charts/case/chart_Latin.htm.

Temporarily ignoring accents, only talking about lower and upper
case letters, you will note that the sorting order of A=41, B=42, C=43,
while the lower case letters from 'a', have weights a=61, b=62, c=63.

This uniformly puts all lower case letters "after" any upper case letters.

Thus -- I am asserting, that any computer using a local for country
preferences, BUT is also using a unicode character set (e.g. UTF-8),
should return sorted results as specified by the character set.

I.e. the utility 'sort' (and any programs that use the collation/sorting
order specified in the core-utils libs) should return A-Z < a-z.


This is currently not the case and is leading to erroneous results
in programs written before locales were considered.  The thing is --
in many cases, within some short period of locales being implemented,
many or most distro's also switched to UTF-8.

Unfortunately it's collation order has not been respected.

I would assert this is a serious bug that should be addressed ASAP...


Thanks,
Linda W.

bug#11621: questionable locale sorting order (especially as related to char ranges in REs)

Reply via email to