To answer some of my own questions and elaborate... On Tue, Feb 07, 2006 at 02:01:00AM -0500, Rich Felker wrote: > 1. Should all Mn/Mc (modifier nonspacing/combining) characters be in > class alpha? > > Most certainly _some_ of them need to be, since otherwise [:alpha:]+ > won't match even a whole word in most South Asian scripts, and of > course these scripts won't be allowed in contexts where only > alphanumeric characters are valid. One problem with no easy solution > is that the initial character of an alphanumeric data item should be > restricted to noncombining characters for most applications, but the > ctype system has no means to enforce this without introducing new > types (although wcwidth could be used).
A few ideas. Solution 1, a horrible hack, is thankfully forbidden. This would be to include all combining characters in the class alnum but not alpha, which gives the correct semantics for alphanumeric identifier fields ([[:alpha:]][[:alnum:]]*). Solution 2 would be to exclude combining letters from alpha and have a separate class mark. Alphabetic names/identifiers would then be [[:alpha:]][[:alpha:][:mark:]]*. Unfortunately this would require all applications to support a nonstandard ctype for the purpose of matching valid names. Solution 3 is to ignore the fact that an initial combining mark is somehow bad, and include combining marks directly in class alpha. There are also variations 2a and 3a, which include _only_ the alphabetic marks in class alpha. The problem with these is that it's very difficult to decide which marks are alphabetic (aside from SA scripts with combining letters) due to the fact that accent marks, etc. can be used in both alphabetic and nonalphabetic ways. Moreover, it seems the class 'alpha' already needs to include a great deal of nonalphabetic characters anyway, since non-Latin digits are excluded from class 'digit'. Thus I think it's clear that in either solution 2 or 3, at least the majority of the combining characters should be included. A question remains whether there are /purely/ punctuational combining marks that should be classified as punctuation rather than alphanumeric. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
