Bug#884943: yeah, non-C collation is bogus

Adam Borowski Thu, 21 Dec 2017 12:57:31 -0800

The real culprit is glibc rather than coreutils, as it's the former which
defines collation rules.


> If not `LC_ALL=C` is set, the sort result is weird.

> Running it with `LC_ALL=C` everything is fine:
> > sort: using simple byte comparison
> > a-ab
> > ____
> > a-ac
> > ____
> > aac
> > ___
>
>
> But switching to en_US.UTF-8 or de_DE.UTF-8 I got the broken result:
> > sort: using 
> > a-ab
> > ____
> > aac
> > ___
> > a-ac
> > ____

Yeah, this is indeed bogus.  Years ago, I compared different libcs and
operating systems, with a sample that looks mostly at version numbers:

glibc, "real" language locales:

0 9
0.9.0
0.9.0-a0-foo-bar
({---=[ 0.9.0-a11 ]=---})
0.9.0-a17-quux
(0.9.0-a2)
0.9.0+a99-1
0.9.0-rc1
0.9.1
0 9 9
({---=[ 0.9-a11 ]=---})
0.9 ab

Windows:

(0.9.0-a2)
({---=[ 0.9.0-a11 ]=---})
({---=[ 0.9-a11 ]=---})
0 9
0 9 9
0.9 ab
0.9.0
0.9.0+a99-1
0.9.0-a0-foo-bar
0.9.0-a17-quux
0.9.0-rc1
0.9.1

glibc, C.UTF-8:

(0.9.0-a2)
({---=[ 0.9-a11 ]=---})
({---=[ 0.9.0-a11 ]=---})
0 9
0 9 9
0.9 ab
0.9.0
0.9.0+a99-1
0.9.0-a0-foo-bar
0.9.0-a17-quux
0.9.0-rc1
0.9.1


Ordering between . and - is debatable, sorting without any heed to symbols
is an obvious mistake.  My guess is that someone took rules written for
19th-century printed dictionaries where symbols were a rare oddity, and
applied them to a computer setting where symbols do matter.

I agree with you that "a-a" must not sort same as "aa" -- this surprises
users and makes it hard to eyeball-search, which is the primary purpose for
human-locale sort.

> So any algorithm or script which depends on a stable sorted order will
> fail.

Well yeah, this should not differ between implementations.  Different
languages do have different collation orders, though: most languages place
accented characters just after the base letter (or sometimes sorted as
same), but for example Swedish wants z<å.  Thus, you must not assume the
order is stable between locales.

> As a mitigation I tried `LC_COLLATE=C`  but still the ‘en_US.UTF-8’
> sorting rules will be used :-/

For this reason, I use LC_COLLATE=C.UTF-8, which works for me (do you
perhaps have LC_ALL, which overrides LC_COLLATE?).  There's the issue of
case-sensitive sorting which is liked by hackers but not by normal people,
but otherwise, for an English language user, C.UTF-8 collation is
drastically better.  An international user would want a<ą<b, though, which
makes C.UTF-8 inadequate here.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.

Bug#884943: yeah, non-C collation is bogus

Reply via email to