Pim Blokland <pblokland at planet dot nl> wrote: > Why is there no UTF-24? > > See, these MathText characters take up a lot of space. No matter how > you encode them; UTF-8, UTF-16 or UTF-32; they always are 4 bytes > long. Now if we had UTF-24, they would only take up 3 bytes.
Yes, but supplementary characters will normally appear in one of two circumstances: (1) as part of a small alphabet (e.g. Deseret, Shavian, Osmanya), interspersed with spaces and punctuation in the U+00xx range, in which case an existing storage format (SCSU) can encode them in only 1 byte each plus an initial 3-byte overhead. (2) as part of a larger set (e.g. math symbols, CJK Extension B), interspersed with even more BMP characters, in which case the bytes saved on each supplementary character are overwhelmed by the bytes squandered on each BMP character. > And since the Unicode character range is formally defined to run no > higher than U+10FFFD, which fits in 3 bytes, I see no reason why > no-one has ever gone to the trouble of defining a 3-byte storage > method. Most likely because no modern computer uses a 3-byte (24-bit) internal processing unit, and because it would be false economy for real-world Unicode text (see (1) and (2) above). > Implementation would be easy; there would be only two variants, > UTF-24LE and UTF-24BE, and that's it. I agree, it certainly was easy to implement. (Oops, did I say that out loud?) -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/