On Fri, 31 Dec 2004, Michael B Allen wrote:
> > mbtowc/towupper approach isn't really sufficient -- for example, a case
> > change can alter the length of the string.
>
> Dear god please tell me your mistaken. Please provide an example?
The classic example is that U+00DF, the German eszett, is a lowercase
letter whose uppercase equivalent is the two-letter group "SS".
Another example is that some precomposed combinations of letter and accent
(e.g. U+0149, apostrophe-n) exist in only one case and must be mapped to a
longer sequence when case changes.
There might also -- I'm not sure -- be some titlecase letter combinations
(combinations of two letters, first uppercase and second lowercase, like
U+01F2) which don't have a full set of single-character lowercase and
uppercase equivalents.
The mbtowc/towupper scheme also fails in situations where case mapping is
context-dependent, e.g. the proper lowercase equivalent of a Greek capital
sigma depends on whether it's the last letter in a word or not, and there
are even worse complexities with capital iota (which may or may not turn
into a combining accent, depending on context *and* whether the text is
ancient or modern).
> > ...more context: why do you want to do this, as part of what?
>
> I just want to upcase or downcase a string.
Alas, that is *not* nearly as simple a concept as one might think. Yet
another issue is that case mappings are slightly language-dependent -- in
English, the lowercase of U+0049 "I" is U+0069 "i", but in Turkish it's
U+0131 dotless-i -- and also style-dependent -- e.g. as I understand it,
accents in monotonic Greek *may* disappear on conversion to uppercase,
depending on the user's preferred style. So even if you convert a whole
string at a time, dealing with the problems noted above, the correct
case counterpart of a string can be context-specific. Worse, a single
locale isn't sufficient context: consider a Turkish text with embedded
English words!
Henry Spencer
[EMAIL PROTECTED]
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/