unicode classes vs c/posix ctype classes

Rich Felker Mon, 06 Feb 2006 22:52:41 -0800

I'm trying to decide on the correct way to assign ctype classes to
UCS, and not sure if there's any consensus on the correct way. My idea
so far is:


Lu -> upper
Ll -> lower
Lt -> alpha
Lm -> alpha
Lo -> alpha

Mn -> alpha
Mc -> alpha
Me -> (none) ???

Nd -> digit
Nl -> (none)
No -> (none)

Zs -> space
Zl -> space
Zp -> space

(only space and tab) -> blank

Cc -> cntrl
Cf -> cntrl (???)
Cs -> (n/a)
Co -> (none)
Cn -> (none)

P?,S? -> punct

The big questions are:

1. Should all Mn/Mc (modifier nonspacing/combining) characters be in
   class alpha?

Most certainly _some_ of them need to be, since otherwise [:alpha:]+
won't match even a whole word in most South Asian scripts, and of
course these scripts won't be allowed in contexts where only
alphanumeric characters are valid. One problem with no easy solution
is that the initial character of an alphanumeric data item should be
restricted to noncombining characters for most applications, but the
ctype system has no means to enforce this without introducing new
types (although wcwidth could be used).

2. Should digit characters outside of ascii 0-9 be classified as
   digits?

My feeling is that in principle they should, but it may cause lots of
problems... If they are classified as digits, does this imply that
strtol, etc. must accept them?

I'm aware that glibc and uClibc both exclude non-Latin 0-9 digits from
the digit class, but this doesn't mean it's the correct behavior.

3. Should characters other than ASCII space and tab be included in
   blank?

My feeling is no, since the 'blank' ctype is intended for parsing
fields in text-format data/config files. The 'space' class is more
appropriate if you want to use it for word breaking, etc. (This raises
another question: should non-breaking space be considered a space
character? What about zero-width space, word joiner, etc.?)

4. Are Me (modifier enclosing) characters ever used for actual
   alphabetic purposes (spelling words/names), or just nonsense like
   drawing circles around letters?

If some are needed for alphabetic purposes, I suppose at least those
must be included in class alpha.

Anyone on this list have strong opinions on these issues, or know of a
place where I can find archived discussion, precedents, normative
documents on the matter, etc.?

Rich





--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

unicode classes vs c/posix ctype classes

Reply via email to