In the file /usr/share/i18n/locales/iso14651_t1
in many contemporary Linux distributions (e.g., SuSE 9.3), the line <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP> defines that the space character affects the sorting order with LC_COLLATE=en_GB.UTF-8 (and in many other locales) at level 4, that is only if there are no differences in - base characters - accents - uppercase/lowercase anywhere in the strings being compared. Is this really what most users expect? I didn't! The UCA has lots of options, and I think some discussion is needed on which of these options are most appropriate for a glibc locale, possibly leading to a revision ore replacement of the of the iso14651_t1 file. References: - Unicode Collation Algorithm (UCA), http://www.unicode.org/reports/tr10/ - ISO TR 14652 (draft: http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14652.pdf) - http://sources.redhat.com/bugzilla/show_bug.cgi?id=374 - https://bugzilla.novell.com/show_bug.cgi?id=152778 Example: $ cat >demo.txt death de luge de-luge deluge de-luge de Luge de-Luge deLuge de-Luge demark ^D and then try $ LC_COLLATE=C sort demo.txt $ LC_COLLATE=en_GTB.UTF-8 sort demo.txt $ LC_COLLATE=en_GB sort demo.txt and see the difference with how your dictionary or phone book sorts these. Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
