To answer some of my own questions and elaborate...

On Tue, Feb 07, 2006 at 02:01:00AM -0500, Rich Felker wrote:
> 1. Should all Mn/Mc (modifier nonspacing/combining) characters be in
>    class alpha?
> 
> Most certainly _some_ of them need to be, since otherwise [:alpha:]+
> won't match even a whole word in most South Asian scripts, and of
> course these scripts won't be allowed in contexts where only
> alphanumeric characters are valid. One problem with no easy solution
> is that the initial character of an alphanumeric data item should be
> restricted to noncombining characters for most applications, but the
> ctype system has no means to enforce this without introducing new
> types (although wcwidth could be used).

A few ideas.

Solution 1, a horrible hack, is thankfully forbidden. This would be to
include all combining characters in the class alnum but not alpha,
which gives the correct semantics for alphanumeric identifier fields
([[:alpha:]][[:alnum:]]*).

Solution 2 would be to exclude combining letters from alpha and have a
separate class mark. Alphabetic names/identifiers would then be
[[:alpha:]][[:alpha:][:mark:]]*. Unfortunately this would require all
applications to support a nonstandard ctype for the purpose of
matching valid names.

Solution 3 is to ignore the fact that an initial combining mark is
somehow bad, and include combining marks directly in class alpha.

There are also variations 2a and 3a, which include _only_ the
alphabetic marks in class alpha. The problem with these is that it's
very difficult to decide which marks are alphabetic (aside from SA
scripts with combining letters) due to the fact that accent marks,
etc. can be used in both alphabetic and nonalphabetic ways. Moreover,
it seems the class 'alpha' already needs to include a great deal of
nonalphabetic characters anyway, since non-Latin digits are excluded
from class 'digit'.

Thus I think it's clear that in either solution 2 or 3, at least the
majority of the combining characters should be included. A question
remains whether there are /purely/ punctuational combining marks that
should be classified as punctuation rather than alphanumeric.

Rich



--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to