for i in `locale -a | grep utf8`; do echo -n $i": "; LC_COLLATE=$i dir; done
| rev | sort |rev
uk_UA.utf8: ð a d s ss ß t th þ ti z
tr_TR.utf8: ð a d s ss ß t th þ ti z
lt_LT.utf8: ð a d s ss ß t th þ ti z
ru_RU.utf8: ð a d s ss ß t th þ ti z
el_GR.utf8: a d ð s ss ß t th þ ti z
ja_JP.utf8: a d s ss t th ti z ð ß þ
ko_KR.utf8: a d s ss t th ti z ð ß þ
de_DE.utf8: a d ð s ss ß t th ti z þ
ar_EG.utf8: a d ð s ss ß t th ti z þ
he_IL.utf8: a d ð s ss ß t th ti z þ
zh_CN.utf8: a d ð s ss ß t th ti z þ
hi_IN.utf8: a d ð s ss ß t th ti z þ
vi_VN.utf8: a d ð s ss ß t th ti z þ
eo_EO.utf8: a d ð s ss ß t th ti z þ
fa_IR.utf8: a d ð s ss ß t th ti z þ
is_IS.utf8: a d ð s ss ß t th ti z þ
en_US.utf8: a d ð s ss ß t th ti z þ
mt_MT.utf8: a d ð s ss ß t th ti z þ
In sorting a, d, s, ss, t, th, ti, z, eth, thorn and eszet (sharp s), I
would expect that if Icelandic and German (the two major languages to
use those characters) matched, the rest of the world would sort them
in the same order. They do; eth sorts with d, eszet sorts with ss, and
thorn sorts after z. Since that's true, I'm surprised to find there are 4
major sort orders for them. ja and ko _seem_ to be doing a straight binary
sort (ja sorts à(a`) and ĉ(c^) after the rest of the alphabet; ko sorts them
before!). uk/tr/lt/ru sort thorn as th and eth out front (the first
explicable but wrong; the second inexplicable to me). el gets it mostly
correct, but sorts thorn with th. de/ar/he/zh/hi/vi/eo/fa/is/en/mt all
agree.
Is there any reason for the differences? Can they - should they - be
changed?
It seems that at least all the non-Latin-script languages should sort
Latin-script
the same way, or at least chose between a standard, language-neutral
'correct'
sort and an efficient sort.
--
David Starner - [EMAIL PROTECTED]
"The pig -- belongs -- to _all_ mankind!" - Invader Zim
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/