Linux UTF-8 locales sort SPACE at level 4

Markus Kuhn Tue, 21 Mar 2006 11:27:23 -0800

In the file

  /usr/share/i18n/locales/iso14651_t1


in many contemporary Linux distributions (e.g., SuSE 9.3), the line

  <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>

defines that the space character affects the sorting order with
LC_COLLATE=en_GB.UTF-8 (and in many other locales) at level 4, that is
only if there are no differences in

  - base characters
  - accents
  - uppercase/lowercase

anywhere in the strings being compared.

Is this really what most users expect? I didn't!

The UCA has lots of options, and I think some discussion is needed
on which of these options are most appropriate for a glibc locale,
possibly leading to a revision ore replacement of the of the iso14651_t1
file.

References:

  - Unicode Collation Algorithm (UCA), http://www.unicode.org/reports/tr10/

  - ISO TR 14652 (draft: http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14652.pdf)

  - http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

  - https://bugzilla.novell.com/show_bug.cgi?id=152778

Example:

$ cat >demo.txt
death
de luge
de-luge
deluge
de-luge
de Luge
de-Luge
deLuge
de-Luge
demark
^D

and then try

$ LC_COLLATE=C            sort demo.txt
$ LC_COLLATE=en_GTB.UTF-8 sort demo.txt
$ LC_COLLATE=en_GB        sort demo.txt

and see the difference with how your dictionary or phone book sorts
these.

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Linux UTF-8 locales sort SPACE at level 4

Reply via email to