On Fri, 31 Dec 2004, Michael B Allen wrote:
> > mbtowc/towupper approach isn't really sufficient -- for example, a case
> > change can alter the length of the string.
> 
> Dear god please tell me your mistaken. Please provide an example?

The classic example is that U+00DF, the German eszett, is a lowercase
letter whose uppercase equivalent is the two-letter group "SS".

Another example is that some precomposed combinations of letter and accent
(e.g. U+0149, apostrophe-n) exist in only one case and must be mapped to a
longer sequence when case changes. 

There might also -- I'm not sure -- be some titlecase letter combinations
(combinations of two letters, first uppercase and second lowercase, like
U+01F2) which don't have a full set of single-character lowercase and
uppercase equivalents.

The mbtowc/towupper scheme also fails in situations where case mapping is
context-dependent, e.g. the proper lowercase equivalent of a Greek capital
sigma depends on whether it's the last letter in a word or not, and there
are even worse complexities with capital iota (which may or may not turn
into a combining accent, depending on context *and* whether the text is
ancient or modern).

> > ...more context:  why do you want to do this, as part of what? 
> 
> I just want to upcase or downcase a string.

Alas, that is *not* nearly as simple a concept as one might think.  Yet
another issue is that case mappings are slightly language-dependent -- in
English, the lowercase of U+0049 "I" is U+0069 "i", but in Turkish it's
U+0131 dotless-i -- and also style-dependent -- e.g. as I understand it,
accents in monotonic Greek *may* disappear on conversion to uppercase,
depending on the user's preferred style.  So even if you convert a whole
string at a time, dealing with the problems noted above, the correct
case counterpart of a string can be context-specific.  Worse, a single
locale isn't sufficient context:  consider a Turkish text with embedded
English words!

                                                          Henry Spencer
                                                       [EMAIL PROTECTED]


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to