Re: UTF-24

Doug Ewell Thu, 03 Apr 2003 23:13:22 -0800

Pim Blokland <pblokland at planet dot nl> wrote:

> Why is there no UTF-24?
>
> See, these MathText characters take up a lot of space. No matter how
> you encode them; UTF-8, UTF-16 or UTF-32; they always are 4 bytes
> long. Now if we had UTF-24, they would only take up 3 bytes.


Yes, but supplementary characters will normally appear in one of two
circumstances:

(1) as part of a small alphabet (e.g. Deseret, Shavian, Osmanya),
interspersed with spaces and punctuation in the U+00xx range, in which
case an existing storage format (SCSU) can encode them in only 1 byte
each plus an initial 3-byte overhead.

(2) as part of a larger set (e.g. math symbols, CJK Extension B),
interspersed with even more BMP characters, in which case the bytes
saved on each supplementary character are overwhelmed by the bytes
squandered on each BMP character.

> And since the Unicode character range is formally defined to run no
> higher than U+10FFFD, which fits in 3 bytes, I see no reason why
> no-one has ever gone to the trouble of defining a 3-byte storage
> method.

Most likely because no modern computer uses a 3-byte (24-bit) internal
processing unit, and because it would be false economy for real-world
Unicode text (see (1) and (2) above).

> Implementation would be easy; there would be only two variants,
> UTF-24LE and UTF-24BE, and that's it.

I agree, it certainly was easy to implement.  (Oops, did I say that out
loud?)

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: UTF-24

Reply via email to