From: Ruslan Zasukhin <[EMAIL PROTECTED]>
Date: Thu, 03 Aug 2006 18:50:20 +0300
From: Marcus Bointon <[EMAIL PROTECTED]>
Date: Thu, 3 Aug 2006 13:17:42 +0100
I may be wrong here, but I'm fairly sure that the dominant unicode
library (IBM's ICU) is centred around UTF-16. That sounds like a good
reason for using it. Generally I've got the impression that UTF-8 is
much better for web use as it's more space-efficient, but it's also
Correction, Marcus.
UTF8 is space-efficient only for languages of ROMAN group.
If you try store Cyrillic-win or Cyrillic-mac that use 1 byte per
Russian
char, into UTF8 you start eat 2 bytes.
For Japan language one char that use 2 bytes in UTF16,
will eat 4 bytes in UTf8.
Three bytes, usually, actually, for CJK. It's still more than 2,
though, so UTF-8 can be less good for CJK than UTF-16.
Also, what about spaces? Those take up 1 byte instead of 2, even in
Japanese. When you start talking about HTML, then the space savings
really add up.
Also, if you really need space savings on UTF-8, you can look at
BOCU. http://www.unicode.org/notes/tn6/tn6-1.html The algorithm is
very simple. Maybe useful for stuff like databases, perhaps.
Also, if you are worried about space savings, it might be an idea to
put your text into NFC. My ElfData plugin has some NFC code. Some CJK
characters can take 3 code points instead of just one... Those are
the "conjoining jamo" letters.
However, I don't have any native experience of oriental languages, so
I may be missing something big. Like perhaps these letters are quite
rare...
So UTF8 is good only for small set of languages.
Good for all languages in my opinion :)
--
http://elfdata.com/plugin/
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>
Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>