From: Marcel <[EMAIL PROTECTED]>
Date: Fri, 16 Jun 2006 17:20:33 +0200

Does this mean, that there are several codes for the same character - =20=
like a "=E9" in Unicode?

Yes, there are several "codes". I think it's called a serialisation of code points, but I'm also a bit loose on the Unicode terminology.

Basically, Unicode has some history behind it. It tried to be all things to all people, and thus we often two codes for the same accented letter. Some people wanted their favourite letters remaining as just one code-point, despite that the more modern way that Unicode.org suggests, is to have one letter + one accent as a seperate letter. The accent is called a "combining character".

So Unicode added both variants, the accents and one code-point that equals both, to try to please both kinds of users.

Even if they didn't do this thing, we'd still need normalisation code. What if you have a letter with two accents?

Well, the letter can look like the same letter, even if the accent is above or below.

For example: A + above-accent + below-accent, looks the same to the user as: A + below-accent + above-accent.

So, Unicode have specified a correct order for combining characters to be reordered. My UnicodeStuff module has a ReorderCombiners method now :)

But not for UTF-8 encoding, or?

The encoding has got nothing to do with this. You'll get exactly the same problem, no more or less problems, by using UTF-8, UTF-16 or UTF-32.

This FAQ http://www.unicode.org/faq/normalization.html tries to explain a bit more, but unfortunately it's all written in gobbedly- gook, so I don't think it'll help ;)

--
http://elfdata.com/plugin/



_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Reply via email to