On 11 Nov 2008, at 13:15, Michael Schnell wrote:

OTOH, in this special case, I don't see why the compiler should "normalize" "u¨" to "ü". If the software is supposed to be handling unicode, the unicode string "u¨" should be considered a perfectly legal two-code-point information consisting of a "u" (a single sub- code in UTF-8) and a double-dot (supposedly two subcodes in UTF-8).

Note that I was simplifying. It's not actually "u¨", but "u" followed by the code point meaning "put ¨ on top of the preceding character". In other words, there is (all in UTF-8)

a) "ü": "LATIN SMALL LETTER U WITH DIAERESIS", encoded as $C3 $BC
b) "ü": "LATIN SMALL LETTER U", encoded as $75, followed by "COMBINING DIAERESIS", which is encoded as $CC $88 c) "u¨": "LATIN SMALL LETTER U", encoded as $75, followed by "DIAERESIS", which is encoded as $C2 $A8

If the user wants to handle this as a single "ü", he should write appropriate code for that. Any automation on that is dangerous.

The character combination actually literally means "ü" in both cases. It's not a decision of a user whether or not it means "ü".


Jonas_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to