Re: mbstoupper or utf8toupper

Andries Brouwer Thu, 06 Jan 2005 06:20:31 -0800

On Wed, Jan 05, 2005 at 10:57:58PM -0500, Michael B Allen wrote:
> Andries Brouwer said:
> > Turkish has i with dot and i without dot,
> > and unsurprisingly the upper case of dotted i is dotted I,
> > the lower case of dotless I is dotless i.
> > Now dotted i and dotless I are in the ASCII range (single UTF-8 byte),
> > while dotless i is U+0131, dotted I is U+0130. Both take two bytes.
> >
> > These are common vowels.
> 
> So you're saying if I do towlower(0x0130) (dotted I) in a Turkish locale
> I'll get 0x0069 (ASCII i)?


Yes. Try a recent glibc system with locale tr_TR or tr_TR.utf8.

Of course many programs are buggy because their authors at first
disregard such details, and then there is a lot of mailing list
activity to get things fixed again.

Andries


(For an example of the type of problems: if someone decides to
recognize commands in arbitrary case, and does this by storing
them in English upper case and comparing that with toupper(cmd)
then things fail in a Turkish locale.)

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: mbstoupper or utf8toupper

Reply via email to