On Fri, Dec 31, 2004 at 08:09:29PM -0500, Michael B Allen wrote:
: On Fri, 31 Dec 2004 11:50:27 -0500 (EST)
: Henry Spencer <[EMAIL PROTECTED]> wrote:
:
: > On Fri, 31 Dec 2004, Michael B Allen wrote:
: > > I'm looking for a C function to convert the case of a UTF-8 string.
: >
: > Bear in mind that doing this right is not a simple exercise, and the
: > mbtowc/towupper approach isn't really sufficient -- for example, a case
: > change can alter the length of the string.
:
: Dear god please tell me your mistaken. Please provide an example?
Well, I don't know whether Henry is claiming to be a god these days,
but as usual he's absolutely correct. I whipped up a little script
to compare UTF-8 lengths of uppercased characters, and it spat out
these mismatches:
Lower Upper Len Chars
===== ===== === =====
0131 0049 2 1 Ä I
017F 0053 2 1 Å S
1FBE 0399 3 2 á Î
A subtler trap is that even UTF-16 is a variable length encoding, and,
while the current character definitions won't change UTF-16 length
on case changes, there's no guarantee that characters won't be added
in the surrogate range that map to non-surrogate characters. That
consideration would apply to UTF-8 as well, even without the above.
I do think you're pretty safe with UTF-32, though. :-)
Larry
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/