On Tue, Sep 13, 2011 at 10:03:01PM +0200, Aurelien Jarno wrote: > On Tue, Sep 13, 2011 at 05:07:46PM +0100, Colin Watson wrote: > > On Tue, Sep 13, 2011 at 05:33:19PM +0200, Aurelien Jarno wrote: > > > Yes similar problems have already been reported. This change has been > > > done as a C locale should not have a collation order. > > > > Why not? Codepoint order collation is perfectly reasonable for a C > > locale. Lots of people use LC_COLLATE=C when all they want is for > > things like [a-z] to work reasonably. > > > > Because it is supposed to replace the C locale, so to follow POSIX > rules like the C locale. I am personally not convinced that we should go > that way, but people who have pushed for this locale (some of them > Cc:ed) have made clear in bugs #522776 and #609306 that it should handle > collation like a C locale. > > Maybe they could follow-up this mail with their arguments.
OK, here goes ;-) The "C.UTF-8" locale /is/ the "C" locale, extended to support UTF-8. That is, it must support the *standard* behaviour mandated in the C, POSIX and SUS standards, or else conforming applications will break. This is the reference for the forthcoming SUSv4 locale definition: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html This standard defines in detail exactly how various aspects of the C/POSIX locale must behave. Conforming applications can expect this behaviour to be guaranteed by a conforming C library. Some aspects are strictly defined, while others offer the possibility for extension. Examples: LC_CTYPE upper Define characters to be classified as uppercase letters. In the POSIX locale, the 26 uppercase letters shall be included: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z lower Define characters to be classified as lowercase letters. In the POSIX locale, the 26 lowercase letters shall be included: a b c d e f g h i j k l m n o p q r s t u v w x y z digit Define the characters to be classified as numeric digits. In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 shall be included. space Define characters to be classified as white-space characters. In the POSIX locale, exactly <space>, <form-feed>, <newline>, <carriage-return>, <tab>, and <vertical-tab> shall be included. cntrl Define characters to be classified as control characters. In the POSIX locale, no characters in classes alpha or print shall be included. xdigit Define the characters to be classified as hexadecimal digits. In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f blank Define characters to be classified as <blank> characters. In the POSIX locale, only the <space> and <tab> shall be included. toupper Define the mapping of lowercase letters to uppercase letters. In the POSIX locale, at a minimum, the 26 lowercase characters: a b c d e f g h i j k l m n o p q r s t u v w x y z shall be mapped to the corresponding 26 uppercase characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z tolower Define the mapping of uppercase letters to lowercase letters. In the POSIX locale, at a minimum, the 26 uppercase characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z shall be mapped to the corresponding 26 lowercase characters: a b c d e f g h i j k l m n o p q r s t u v w x y z Summary: • space, cntrl, xdigit, blank are specified exactly. The C locale must only use the specified characters. It can't be extended to support other characters since it explicitly states this is not allowed. • upper, lower, toupper and tolower specify minimum requirements. It's permitted to extend these to support other characters. LC_COLLATE The standard specifies a linear incremental sort order from U+0000 to U+007F. That's strictly required by the standard. There's a lot of software out there which explicitly switches to the C locale (or just setlocale(LC_COLLATE, "C")) to get a locale-independent guaranteed known sort order. If this was to be changed, a lot of software would break. My take on this is that a UTF-8 C locale should extend the ordering so that it just sorts any UCS codepoint by value (i.e. U+0000 to U+FFFF). This extends the existing order cleanly, and I think matches expectations of what the C locale provides. Regarding handling of non-UTF-8 input, I've not tested how it's handled for regular locales. AFAICT it sorts on UCS codepoints, so it would probably have already discarded them during conversion? While in an ideal world it would be great if the "C" locale could provide the same level of UTF-8/UCS support as other "real" UTF-8 locales, the main issue is ensuring that we comply with the letter of the standards here--unlike every other locale, this one is explicitly defined to provide certain things. The other consideration is that the "C" locale is by definition a "minimal" locale that provides a bare minimum of functionality; if you want to use it to do advanced text processing, I think that's probably outside its scope. If we do want a universally available locale that does provide this level of service, then we should probably name it to something other than "C"; or simply mandate the existence of e.g. en_US.UTF-8. I'd really like to see this implemented directly in glibc, since it's really just a simple modification of the existing hardcoded C locale. But it does require processing input/output as UTF-8 and enabling some of the encoding-related stuff to correctly support wide streams etc. I think. I did start hacking on it, but glibc was mostly undocumented and very complex, so this never got anywhere. I'll have another attempt sometime, but if you know anyone with more familiarity with the source (or upstream!), that would probably be a better plan. As mentioned on IRC, I joined the Austin Group, and when I have time I will ask them about UTF-8 support in the C locale, and how this can be implemented in compliance with the standard. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail. -- To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110913204540.gj3...@codelibre.net