Re: SOT: Premature Optimization

Theodore H. Smith Thu, 03 Aug 2006 10:23:18 -0700

From: Ruslan Zasukhin <[EMAIL PROTECTED]>
Date: Thu, 03 Aug 2006 18:50:20 +0300

From: Marcus Bointon <[EMAIL PROTECTED]>
Date: Thu, 3 Aug 2006 13:17:42 +0100

I may be wrong here, but I'm fairly sure that the dominant unicode
library (IBM's ICU) is centred around UTF-16. That sounds like a good
reason for using it. Generally I've got the impression that UTF-8 is
much better for web use as it's more space-efficient, but it's also


Correction, Marcus.

UTF8 is space-efficient only for languages of ROMAN group.

If you try store Cyrillic-win or Cyrillic-mac that use 1 byte perRussian

char, into UTF8 you start eat 2 bytes.

For Japan language one char that use 2 bytes in UTF16,
will eat 4 bytes in UTf8.

Three bytes, usually, actually, for CJK. It's still more than 2,though, so UTF-8 can be less good for CJK than UTF-16.

Also, what about spaces? Those take up 1 byte instead of 2, even inJapanese. When you start talking about HTML, then the space savingsreally add up.

Also, if you really need space savings on UTF-8, you can look atBOCU. http://www.unicode.org/notes/tn6/tn6-1.html The algorithm isvery simple. Maybe useful for stuff like databases, perhaps.

Also, if you are worried about space savings, it might be an idea toput your text into NFC. My ElfData plugin has some NFC code. Some CJKcharacters can take 3 code points instead of just one... Those arethe "conjoining jamo" letters.

However, I don't have any native experience of oriental languages, soI may be missing something big. Like perhaps these letters are quiterare...

So UTF8 is good only for small set of languages.


Good for all languages in my opinion :)

--
http://elfdata.com/plugin/



_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Re: SOT: Premature Optimization

Reply via email to