Odd differences in locale sorting

David Starner Thu, 02 Aug 2001 22:37:56 -0700
for i in `locale -a | grep utf8`; do echo -n $i": "; LC_COLLATE=$i dir; done
| rev | sort |rev
uk_UA.utf8: ð  a  d  s  ss ß  t  th  þ  ti  z
tr_TR.utf8: ð  a  d  s  ss ß  t  th  þ  ti  z
lt_LT.utf8: ð  a  d  s  ss ß  t  th  þ  ti  z
ru_RU.utf8: ð  a  d  s  ss ß  t  th  þ  ti  z
el_GR.utf8: a  d  ð  s  ss ß  t  th  þ  ti  z
ja_JP.utf8: a  d  s  ss  t th  ti z  ð  ß  þ
ko_KR.utf8: a  d  s  ss  t th  ti z  ð  ß  þ
de_DE.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
ar_EG.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
he_IL.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
zh_CN.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
hi_IN.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
vi_VN.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
eo_EO.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
fa_IR.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
is_IS.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
en_US.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ
mt_MT.utf8: a  d  ð  s  ss ß  t  th  ti  z  þ


In sorting a, d, s, ss, t, th, ti, z, eth, thorn and eszet (sharp s), I
would expect that if Icelandic and German (the two major languages to
use those characters) matched, the rest of the world would sort them
in the same order. They do; eth sorts with d, eszet sorts with ss, and
thorn sorts after z. Since that's true, I'm surprised to find there are 4
major sort orders for them. ja and ko _seem_ to be doing a straight binary
sort (ja sorts à(a`) and ĉ(c^) after the rest of the alphabet; ko sorts them
before!). uk/tr/lt/ru sort thorn as th and eth out front (the first
explicable but wrong; the second inexplicable to me). el gets it mostly
correct, but sorts thorn with th. de/ar/he/zh/hi/vi/eo/fa/is/en/mt all
agree.

Is there any reason for the differences? Can they - should they - be
changed?
It seems that at least all the non-Latin-script languages should sort
Latin-script
the same way, or at least chose between a standard, language-neutral
'correct'
sort and an efficient sort.


--
David Starner - [EMAIL PROTECTED]
"The pig -- belongs -- to _all_ mankind!" - Invader Zim

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Odd differences in locale sorting

Reply via email to