Nicest UTF

Theodore H. Smith Wed, 01 Dec 2004 15:00:53 -0800

Assuming you had no legacy code. And no "handy" libraries either, except for byte libraries in C (string.h, stdlib.h). Just a C++ compiler, a "blank page" to draw on, and a requirement to do a lot of Unicode text processing.

Apart from that the real world would still apply, you may want to use this code in the future, you may want to use it from within other environments, but even those environments wouldn't have legacy code or existing Unicode libraries.

Obviously, if you had existing libraries that made things "easier" (so you hope), or legacy code that used UCS2 then things are different, so we are ignoring these cases. Its all fresh.

What would be the nicest UTF to use?

I think UTF8 would be the nicest UTF.

Some people say that UTF32 offers more simpler better code, from an architectural view-point. After all, every character is one variable in RAM.

But does UTF32 offer simpler better faster cleaner code?

A Unicode "character" can be decomposed. Meaning that a character could still be a few variables of UTF32 code points! You'll still need to carry around "strings" of characters, instead of characters.

The fact that it is totally bloat worthy, isn't so great. Bloat mongers aren't your friend.

The fact that it is incompatible with existing byte code doesn't help. UTF8 can be used with the existing byte libraries just fine. The "problem" of multiple code points for a decomposed char isn't relatively worse, because we still have decomposed char in UTF32.

An accented A in UTF-8, would be 3 bytes decomposed. In UTF32, thats 8 bytes!

Also, UTF-8 is a great file format, or socket-transfer format. Not needing to convert is great. Its also compatible with C strings.

Also, UTF-8 has no endian issues.

Also, UTF-8's compactness makes it great for processing large volumes of UTF-8.

I think that UTF16 is really bad. UTF16 is basically popular, because so many people thought UCS2 was the answer to internationalisation. UTF16 was kind of a "switch and bait" technique (unintentional of course). Had it been known that we need to treat characters as multiple units of variables, we might as well have gone for UTF8!

The people who like UTF16 because UTF8 takes 3 bytes where UTF16 takes 2 for their favourite language... I can see their point. But even then, with the prevalence of markup, and the prevalence of 1 byte punctuation, the trade-off is really quite small. UTF-8 (byte) processing code is also more compatible with that Unicode compression scheme whose acronym I forget (something like SCSU).

I think with that compression scheme, the Unicode text would be even smaller than UTF16. SCSU (or whatever its called) can be processed as markup (XML for example) with no decompression, so its quite handy.

Anyhow, thats why I think UTF-8 is really the way to go.

Its too bad MicroSoft and Apple didn't realise the same, before they made their silly UCS-2 APIs.

--
   Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
   Industrial strength string processing code, made easy.
   (If you believe that's an oxymoron, see for yourself.)

Nicest UTF

Reply via email to