I'm trying to decide on the correct way to assign ctype classes to UCS, and not sure if there's any consensus on the correct way. My idea so far is:
Lu -> upper Ll -> lower Lt -> alpha Lm -> alpha Lo -> alpha Mn -> alpha Mc -> alpha Me -> (none) ??? Nd -> digit Nl -> (none) No -> (none) Zs -> space Zl -> space Zp -> space (only space and tab) -> blank Cc -> cntrl Cf -> cntrl (???) Cs -> (n/a) Co -> (none) Cn -> (none) P?,S? -> punct The big questions are: 1. Should all Mn/Mc (modifier nonspacing/combining) characters be in class alpha? Most certainly _some_ of them need to be, since otherwise [:alpha:]+ won't match even a whole word in most South Asian scripts, and of course these scripts won't be allowed in contexts where only alphanumeric characters are valid. One problem with no easy solution is that the initial character of an alphanumeric data item should be restricted to noncombining characters for most applications, but the ctype system has no means to enforce this without introducing new types (although wcwidth could be used). 2. Should digit characters outside of ascii 0-9 be classified as digits? My feeling is that in principle they should, but it may cause lots of problems... If they are classified as digits, does this imply that strtol, etc. must accept them? I'm aware that glibc and uClibc both exclude non-Latin 0-9 digits from the digit class, but this doesn't mean it's the correct behavior. 3. Should characters other than ASCII space and tab be included in blank? My feeling is no, since the 'blank' ctype is intended for parsing fields in text-format data/config files. The 'space' class is more appropriate if you want to use it for word breaking, etc. (This raises another question: should non-breaking space be considered a space character? What about zero-width space, word joiner, etc.?) 4. Are Me (modifier enclosing) characters ever used for actual alphabetic purposes (spelling words/names), or just nonsense like drawing circles around letters? If some are needed for alphabetic purposes, I suppose at least those must be included in class alpha. Anyone on this list have strong opinions on these issues, or know of a place where I can find archived discussion, precedents, normative documents on the matter, etc.? Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
