The real culprit is glibc rather than coreutils, as it's the former which defines collation rules.
> If not `LC_ALL=C` is set, the sort result is weird. > Running it with `LC_ALL=C` everything is fine: > > sort: using simple byte comparison > > a-ab > > ____ > > a-ac > > ____ > > aac > > ___ > > > But switching to en_US.UTF-8 or de_DE.UTF-8 I got the broken result: > > sort: using > > a-ab > > ____ > > aac > > ___ > > a-ac > > ____ Yeah, this is indeed bogus. Years ago, I compared different libcs and operating systems, with a sample that looks mostly at version numbers: glibc, "real" language locales: 0 9 0.9.0 0.9.0-a0-foo-bar ({---=[ 0.9.0-a11 ]=---}) 0.9.0-a17-quux (0.9.0-a2) 0.9.0+a99-1 0.9.0-rc1 0.9.1 0 9 9 ({---=[ 0.9-a11 ]=---}) 0.9 ab Windows: (0.9.0-a2) ({---=[ 0.9.0-a11 ]=---}) ({---=[ 0.9-a11 ]=---}) 0 9 0 9 9 0.9 ab 0.9.0 0.9.0+a99-1 0.9.0-a0-foo-bar 0.9.0-a17-quux 0.9.0-rc1 0.9.1 glibc, C.UTF-8: (0.9.0-a2) ({---=[ 0.9-a11 ]=---}) ({---=[ 0.9.0-a11 ]=---}) 0 9 0 9 9 0.9 ab 0.9.0 0.9.0+a99-1 0.9.0-a0-foo-bar 0.9.0-a17-quux 0.9.0-rc1 0.9.1 Ordering between . and - is debatable, sorting without any heed to symbols is an obvious mistake. My guess is that someone took rules written for 19th-century printed dictionaries where symbols were a rare oddity, and applied them to a computer setting where symbols do matter. I agree with you that "a-a" must not sort same as "aa" -- this surprises users and makes it hard to eyeball-search, which is the primary purpose for human-locale sort. > So any algorithm or script which depends on a stable sorted order will > fail. Well yeah, this should not differ between implementations. Different languages do have different collation orders, though: most languages place accented characters just after the base letter (or sometimes sorted as same), but for example Swedish wants z<å. Thus, you must not assume the order is stable between locales. > As a mitigation I tried `LC_COLLATE=C` but still the ‘en_US.UTF-8’ > sorting rules will be used :-/ For this reason, I use LC_COLLATE=C.UTF-8, which works for me (do you perhaps have LC_ALL, which overrides LC_COLLATE?). There's the issue of case-sensitive sorting which is liked by hackers but not by normal people, but otherwise, for an English language user, C.UTF-8 collation is drastically better. An international user would want a<ą<b, though, which makes C.UTF-8 inadequate here. Meow! -- // If you believe in so-called "intellectual property", please immediately // cease using counterfeit alphabets. Instead, contact the nearest temple // of Amon, whose priests will provide you with scribal services for all // your writing needs, for Reasonable And Non-Discriminatory prices.