RE: UTF-24

2003-04-04 Thread Carl W. Brown
Doug,

 Most likely because no modern computer uses a 3-byte (24-bit) internal
 processing unit, and because it would be false economy for real-world
 Unicode text (see (1) and (2) above).

What would be worse is to have an implementation like the old IBM 360 computers where 
the 24 bit addresses had to be on full word (32 bit) boundaries so they used the high 
order byte to store flags and other data.

Carl







Re: UTF-24

2003-04-03 Thread Markus Scherer
Pim Blokland wrote:
Why is there no UTF-24?
Well, I once proposed UTF-20...

See, these MathText characters take up a lot of space. No matter how
you encode them; UTF-8, UTF-16 or UTF-32; they always are 4 bytes
long.
True for them alone, in those UTFs. Short of defining another Unicode encoding, there are two 
answers that I can offer you:

1. Such characters are expected to be the minority of text, I suppose even in Math text, because 
there are lots of other characters in such documents - punctuation, spaces, digits, regular text - 
that are mostly on the BMP and thus shorter. So total Math documents with some MathText 
supplementary characters will use, on average, fewer than 3B/code point in UTF-8/16.

2. If you want compression, use the existing SCSU (UTR #6) and BOCU-1 (UTN #6), or general-purpose 
compressions like bzip2.

Note that this is only for text interchange - the majority of Unicode-aware software programs uses 
UTF-16 internally.

Best regards,
markus
--
Opinions expressed here may not reflect my company's positions unless otherwise noted.