Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

Kenneth Whistler Tue, 07 Dec 2004 12:50:08 -0800

Philippe stated, and I need to correct:

> UTF-24 already exists as an encoding form (it is identical to UTF-32), if 
> you just consider that encoding forms just need to be able to represent a 
> valid code range within a single code unit.


This is false.

Unicode encoding forms exist by virtue of the establishment of
them as standard, by actions of the standardizing organization,
the Unicode Consortium.

> UTF-32 is not meant to be restricted on 32-bit representations.

This is false. The definition of UTF-32 is:

  "The Unicode encoding form which assigns each Unicode scalar
   value to a single unsigned 32-bit code unit with the same
   numeric value as the Unicode scalar value."
   
It is true that UTF-32 could be (and is) implemented on computers
which hold 32-bit numeric types transiently in 64-bit registers
(or even other size registers), but if an array of 64-bit integers
(or 24-bit integers) were handed to some API claiming to be UTF-32,
it would simply be nonconformant to the standard.

UTF-24 does not "already exist as an encoding form" -- it already
exists as one of a large number of more or less idle speculations
by character numerologists regarding other cutesy ways to handle
Unicode characters on computers. Many of those cutesy ways are
mere thought experiments or even simply jokes.

> However it's true that UTF-24BE and UTF-24LE could be useful as a encoding 
> schemes for serializations to byte-oriented streams, suppressing one 
> unnecessary byte per code point.

"Could be", perhaps, but is not.

Implementers using UTF-32 for processing efficiency, but who have
bandwidth constraints in some streaming context should simply
use one of the CES's with better size characteristics or use
a compression on their data.

> Note that 64-bit systems could do the same: 3 code points per 64-bit unit, 
> requires only 63 bits, that are stored in a single positive 64-bit integer 
> (the remaining bit would be the sign bit, always set to 0, avoiding problems 
> related to sign extensions). And even today's system could use such 
> representation as well, given that most 32-bit processors of today also have 
> the internal capabilities to manage 64-bit integers natively.

This is just an incredibly bad idea.

Packing instructions in large-word microprocessors is one thing. You
have built-in microcode which handles that, hidden away from
application-level programming, and carefully architected for
maximal processor efficiency.

But attempting to pack character data into microprocessor words, just
because you have bits available, would just detract from the efficiency
of handling that data. Storage is not the issue -- you want to
get the characters in and out of the registers as efficiently as
possible. UTF-32 works fine for that. UTF-16 works almost as well,
in aggregate, for that. And I could care less that when U+0061
goes in a 64-bit register for manipulation, the high 57 bits are
all set to zero.

> Strings could be encoded as well using only 64-bit code units that would 
> each store 1 to 3 code points, 

Yes, and pigs could fly, if they had big enough wings.

> the unused positions being filled with 
> invalid codepoints out the Unicode space (for example by setting all 21 bits 
> to 1, producing the out-of-range code point 0x1FFFFF, used as a filler for 
> missing code points, notably when the string to encode is not an exact 
> multiple of 3 code points). Then, these 64-bit code units could be 
> serialized on byte streams as well, multiplying the number of possibilities: 
> UTF-64BE and UTF-64LE? One interest of such scheme is that it would be more 
> compact than UTF-32, because this UTF-64 encoding scheme would waste only 1 
> bit for 3 codepoints, instead 1 byte and 3 bits for each codepoint with 
> UTF-32!

Wow!

> You can imagine many other encoding schemes, depending on your architecture 
> choices and constraints...

Yes, one can imagine all sorts of strange things. I myself
imagined UTF-17 once. But there is a difference between having
fun imagining strange things and filling the list with
confusing misinterpretations of the status and use of
UTF-8, UTF-16, and UTF-32.

--Ken

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

Reply via email to