Re: unicode classes vs c/posix ctype classes

Rich Felker Tue, 07 Feb 2006 10:49:18 -0800

On Tue, Feb 07, 2006 at 01:30:27PM +0100, Bruno Haible wrote:
> Rich Felker wrote:
> > I'm trying to decide on the correct way to assign ctype classes to
> > UCS, and not sure if there's any consensus on the correct way.
> 
> There's certainly some amount of judgement involved. The way it's done
> in glibc is found in glibc/localedata/gen-unicode-ctype.c.


OK, thanks, I'll have a look. BTW, is there a reason you used C for
this rather than a several-line sed script to apply the 'corrections'
to UnicodeData.txt?

BTW, glibc seems to be highly incorrect on isalpha. Basically any word
in most South Asian languages is nonalphabetic according to the rules
as I read them, due to excluding combining letters. The only correct
one is Thai where you've included exceptions.

> > Lu -> upper
> > Ll -> lower
> 
> I think this need to take into account the towupper and towlower mappings.

I don't see how this is so. Classifying a character as upper/lower is
much more general than case mappings, since some relationships cannot
be represented with case mappings. I don't see anywhere that ISO C or
POSIX requires toupper to change a character in order for that
character to be considered lowercase, or vice versa.

> > Lt -> alpha
> > Lm -> alpha
> > Lo -> alpha
> 
> There are a couple of special cases to be considered here.

..like? just the errors in Thai?

> > Nd -> digit
> 
> If you do that, the resulting locale is not ISO C 99 compliant.

Thanks for this info. I guess that answers the question. :)

> > Zs -> space
> > Zl -> space
> > Zp -> space
> 
> U+00A0 shouldn't be treated like a space.

Yes, I asked about that below. Thanks for the answer. Are there other
'space' characters that should not be treated as a space?

> > Cf -> cntrl (???)
> 
> I wouldn't do so. Many programs use iscntrl() as a test whether to drop
> a character from the output. Cf class characters shouldn't be dropped.

Good point. Then should they be printable but non-graphic? Or totally
unclassified?

> > 3. Should characters other than ASCII space and tab be included in
> >    blank?
> 
> This is a muddy area.

:)

> > 4. Are Me (modifier enclosing) characters ever used for actual
> >    alphabetic purposes (spelling words/names), or just nonsense like
> >    drawing circles around letters?
> 
> There is also:
> 0488;COMBINING CYRILLIC HUNDRED THOUSANDS SIGN;Me;0;NSM;;;;;N;;;;;
> 0489;COMBINING CYRILLIC MILLIONS SIGN;Me;0;NSM;;;;;N;;;;;

These are non-alphabetic, right?

> 06DE;ARABIC START OF RUB EL HIZB;Me;0;NSM;;;;;N;;;;;

This seems to be just an annotation mark, but it's grouped among other
annotation marks of other combining classes so I suppose it would be
bad to treat them differently.


Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: unicode classes vs c/posix ctype classes

Reply via email to