Re: mbstoupper or utf8toupper

Larry Wall Fri, 31 Dec 2004 18:53:35 -0800

On Fri, Dec 31, 2004 at 08:09:29PM -0500, Michael B Allen wrote:
: On Fri, 31 Dec 2004 11:50:27 -0500 (EST)
: Henry Spencer <[EMAIL PROTECTED]> wrote:
: 
: > On Fri, 31 Dec 2004, Michael B Allen wrote:
: > > I'm looking for a C function to convert the case of a UTF-8 string.
: > 
: > Bear in mind that doing this right is not a simple exercise, and the
: > mbtowc/towupper approach isn't really sufficient -- for example, a case
: > change can alter the length of the string.
: 
: Dear god please tell me your mistaken. Please provide an example?


Well, I don't know whether Henry is claiming to be a god these days,
but as usual he's absolutely correct.  I whipped up a little script
to compare UTF-8 lengths of uppercased characters, and it spat out
these mismatches:

    Lower Upper  Len   Chars
    ===== =====  ===   =====
    0131  0049   2 1   Ä I 
    017F  0053   2 1   Å S 
    1FBE  0399   3 2   á Î 

A subtler trap is that even UTF-16 is a variable length encoding, and,
while the current character definitions won't change UTF-16 length
on case changes, there's no guarantee that characters won't be added
in the surrogate range that map to non-surrogate characters.  That
consideration would apply to UTF-8 as well, even without the above.

I do think you're pretty safe with UTF-32, though.  :-)

Larry

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: mbstoupper or utf8toupper

Reply via email to