Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Kenneth Whistler
Philippe stated, and I need to correct:

 UTF-24 already exists as an encoding form (it is identical to UTF-32), if 
 you just consider that encoding forms just need to be able to represent a 
 valid code range within a single code unit.

This is false.

Unicode encoding forms exist by virtue of the establishment of
them as standard, by actions of the standardizing organization,
the Unicode Consortium.

 UTF-32 is not meant to be restricted on 32-bit representations.

This is false. The definition of UTF-32 is:

  The Unicode encoding form which assigns each Unicode scalar
   value to a single unsigned 32-bit code unit with the same
   numeric value as the Unicode scalar value.
   
It is true that UTF-32 could be (and is) implemented on computers
which hold 32-bit numeric types transiently in 64-bit registers
(or even other size registers), but if an array of 64-bit integers
(or 24-bit integers) were handed to some API claiming to be UTF-32,
it would simply be nonconformant to the standard.

UTF-24 does not already exist as an encoding form -- it already
exists as one of a large number of more or less idle speculations
by character numerologists regarding other cutesy ways to handle
Unicode characters on computers. Many of those cutesy ways are
mere thought experiments or even simply jokes.

 However it's true that UTF-24BE and UTF-24LE could be useful as a encoding 
 schemes for serializations to byte-oriented streams, suppressing one 
 unnecessary byte per code point.

Could be, perhaps, but is not.

Implementers using UTF-32 for processing efficiency, but who have
bandwidth constraints in some streaming context should simply
use one of the CES's with better size characteristics or use
a compression on their data.

 Note that 64-bit systems could do the same: 3 code points per 64-bit unit, 
 requires only 63 bits, that are stored in a single positive 64-bit integer 
 (the remaining bit would be the sign bit, always set to 0, avoiding problems 
 related to sign extensions). And even today's system could use such 
 representation as well, given that most 32-bit processors of today also have 
 the internal capabilities to manage 64-bit integers natively.

This is just an incredibly bad idea.

Packing instructions in large-word microprocessors is one thing. You
have built-in microcode which handles that, hidden away from
application-level programming, and carefully architected for
maximal processor efficiency.

But attempting to pack character data into microprocessor words, just
because you have bits available, would just detract from the efficiency
of handling that data. Storage is not the issue -- you want to
get the characters in and out of the registers as efficiently as
possible. UTF-32 works fine for that. UTF-16 works almost as well,
in aggregate, for that. And I could care less that when U+0061
goes in a 64-bit register for manipulation, the high 57 bits are
all set to zero.

 Strings could be encoded as well using only 64-bit code units that would 
 each store 1 to 3 code points, 

Yes, and pigs could fly, if they had big enough wings.

 the unused positions being filled with 
 invalid codepoints out the Unicode space (for example by setting all 21 bits 
 to 1, producing the out-of-range code point 0x1F, used as a filler for 
 missing code points, notably when the string to encode is not an exact 
 multiple of 3 code points). Then, these 64-bit code units could be 
 serialized on byte streams as well, multiplying the number of possibilities: 
 UTF-64BE and UTF-64LE? One interest of such scheme is that it would be more 
 compact than UTF-32, because this UTF-64 encoding scheme would waste only 1 
 bit for 3 codepoints, instead 1 byte and 3 bits for each codepoint with 
 UTF-32!

Wow!

 You can imagine many other encoding schemes, depending on your architecture 
 choices and constraints...

Yes, one can imagine all sorts of strange things. I myself
imagined UTF-17 once. But there is a difference between having
fun imagining strange things and filling the list with
confusing misinterpretations of the status and use of
UTF-8, UTF-16, and UTF-32.

--Ken




Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Rick McGowan
 Yes, and pigs could fly, if they had big enough wings.

An 8-foot wingspan should do it. For picture of said flying pig see:

http://www.cincinnati.com/bigpiggig/profile_091700.html
http://www.cincinnati.com/bigpiggig/images/pig091700.jpg

Rick



Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Philippe Verdy
From: Kenneth Whistler [EMAIL PROTECTED]
Yes, and pigs could fly, if they had big enough wings.
Once again, this is a creative comment. As if Unicode had to be bound on 
architectural constraints such as the requirement of representing code units 
(which are architectural for a system) only as 16-bit or 32-bit units, 
ignoring the fact that technologies do evolve and will not necessarily keep 
this constraint. 64-bit systems already exist today, and even if they have, 
for now, the architectural capability of handling efficiently 16-bit and 
32-bit code units so that they can be addressed individually, this will 
possibly not be the case in the future.

When I look at the encoding forms such as UTF-16 and UTF-32, they just 
define the value ranges in which code units will be be valid, but not 
necessarily their size. You are mixing this with encoding schemes, which is 
what is needed for interoperability, and where other factors such as bit or 
byte ordering is also important in addition to the value range.

I won't see anything wrong if a system is set so that UTF-32 code units will 
be stored in 24-bit or even 64-bit memory cells, as long as they respect and 
fully represent the value range defined in encoding forms, and if the system 
also provides an interface to convert them with encoding schemes to 
interoperable streams of 8-bit bytes.

Are you saying that UTF-32 code units need to be able to represent any 
32-bit value, even if the valid range is limited, for now to the 17 first 
planes?
An API on a 64-bit system that would say that it requires strings being 
stored with UTF-32 would also define how UTF-32 code units are represented. 
As long as the valid range 0 to 0x10 can be represented, this interface 
will be fine. If this system is designed so that two or three code units 
will be stored in a single 64-bit memory cell, no violation will occur in 
the valid range.

More interestingly, there already exists systems where memory is adressable 
by units of 1 bit, and on these systems, an UTF-32 code unit will work 
perfectly if code units are stored by steps of 21 bits of memory. On 64-bit 
systems, the possibility of addressing any groups individual bits will 
become an interesting option, notably when handling complex data structures 
such as bitfields, data compressors, bitmaps, ... No more need to use costly 
shifts and masking. Nothing would prevent such system to offer 
interoperability with 8-bit byte based systems (note also that recent memory 
technologies use fast serial interfaces instead of parallel buses, so that 
the memory granularity is less important).

The only cost for bit-addressing is that it just requires 3 bits of address, 
but in a 64-bit address, this cost seems very low becaue the global 
addressable space will still be... more than 2.3*10^18 bytes, much more than 
any computer will manage in a single process for the next century (according 
to the Moore's law which doubles the computing capabilities every 3 years). 
Even such scheme would not limit the performance given that memory caches 
are paged, and these caches are always increasing, eliminating most of the 
costs and problems related to data alignment experimented today on bus-based 
systems.

Other territories are also still unexplored in microprocessors, notably the 
possibility of using non-binary numeric systems (think about optical or 
magnetic systems which could outperform the current electric systems due to 
reduced power and heat caused by currents of electrons through molecular 
substrates, replacing them by shifts of atomic states caused by light rays, 
and the computing possibilities offered by light diffraction through 
cristals). The lowest granularity of information in some future may be 
larger than a dual-state bit, meaning that todays 8-bit systems would need 
to be emulated using other numerical systems...
(Note for example that to store the range 0..0x10, you would need 13 
digits on a ternary system, and to store the range of 32-bit integers, you 
would need 21 ternary digits; memry technologies for such systems may use 
byte units made of 6 ternary digits, so programmers would have the choice 
between 3 ternary bytes, i.e. 18 ternary digits, to store our 21-bit code 
units, or 4 ternary bytes, i.e. 24 ternary digits or more than 34 binary 
bits, to be able to store the whole 32-bit range.)

Nothing there is impossible for the future (when it will become more and 
more difficult to increase the density of transistors, or to reduce further 
the voltage, or to increase the working frequency, or to avoid the 
inevitable and random presence of natural defects in substrates; escaping 
from the historic binary-only systems may offer interesting opportunities 
for further performance increase).




Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Kenneth Whistler
Philippe continued:

 As if Unicode had to be bound on 
 architectural constraints such as the requirement of representing code units 
 (which are architectural for a system) only as 16-bit or 32-bit units, 

Yes, it does. By definition. In the standard.

 ignoring the fact that technologies do evolve and will not necessarily keep 
 this constraint. 64-bit systems already exist today, and even if they have, 
 for now, the architectural capability of handling efficiently 16-bit and 
 32-bit code units so that they can be addressed individually, this will 
 possibly not be the case in the future.

This is just as irrelevant as worrying about the fact that 8-bit
character encodings may not be handled efficiently by some 32-bit
processors.

 When I look at the encoding forms such as UTF-16 and UTF-32, they just 
 define the value ranges in which code units will be be valid, but not 
 necessarily their size. 

Philippe, you are wrong. Go reread the standard. Each of the encoding
forms is *explicitly* defined in terms of code unit size in bits.

  The Unicode Standard uses 8-bit code units in the UTF-8 encoding
   form, 16-bit code units in the UTF-16 encoding form, and 32-bit
   code units in the UTF-32 encoding form.
   
If there is something ambiguous or unclear in wording such as that,
I think the UTC would like to know about it.

 You are mixing this with encoding schemes, which is 
 what is needed for interoperability, and where other factors such as bit or 
 byte ordering is also important in addition to the value range.

I am not mixing it up -- you are, unfortunately. And it is most
unhelpful on this list to have people waxing on, with
apparently authoritative statements about the architecture
of the Unicode Standard, which on examination turn out to be
flat wrong.

 I won't see anything wrong if a system is set so that UTF-32 code units will 
 be stored in 24-bit or even 64-bit memory cells, as long as they respect and 
 fully represent the value range defined in encoding forms, 

Correct. And I said as much. There is nothing wrong with implementing
UTF-32 on a 64-bit processor. Putting a UTF-32 code point into
a 64-bit register is fine. What you have to watch out for is
handing me a 64-bit array of ints and claiming that it is a
UTF-32 sequence of code points -- it isn't.

 and if the system 
 also provides an interface to convert them with encoding schemes to 
 interoperable streams of 8-bit bytes.

No, you have to have an interface which hands me the correct
data type when I declare it uint_32, and which gives me correct
offsets in memory if I walk an index pointer down an array.
That applies to the encoding *form*, and is completely separate
from provision of any streaming interface that wants to feed
data back and form in terms of byte streams.

 Are you saying that UTF-32 code units need to be able to represent any 
 32-bit value, even if the valid range is limited, for now to the 17 first 
 planes?

Yes.

 An API on a 64-bit system that would say that it requires strings being 
 stored with UTF-32 would also define how UTF-32 code units are represented. 
 As long as the valid range 0 to 0x10 can be represented, this interface 
 will be fine. 

No, it will not. Read the standard.

An API on a 64-bit system that uses an unsigned 32-bit datatype for UTF-32
is fine. It isn't fine if it uses an unsigned 64-bit datatype for
UTF-32.

 If this system is designed so that two or three code units 
 will be stored in a single 64-bit memory cell, no violation will occur in 
 the valid range.

You can do whatever the heck crazy thing you want to do internal
to your data manipulation, but you cannot surface a datatype
packed that way and conformantly claim that it is UTF-32.

 More interestingly, there already exists systems where memory is adressable 
 by units of 1 bit, and on these systems, ...

[excised some vamping on the future of computers]

 Nothing there is impossible for the future (when it will become more and 
 more difficult to increase the density of transistors, or to reduce further 
 the voltage, or to increase the working frequency, or to avoid the 
 inevitable and random presence of natural defects in substrates; escaping 
 from the historic binary-only systems may offer interesting opportunities 
 for further performance increase).

Look, I don't care if the processors are dealing in qubits on
molecular arrays under the covers. It is the job of the hardware
folks to surface appropriate machine instructions that compiler
makers can use to surface appropriate formal language constructs
to programmers to enable hooking the defined datatypes of
the character encoding standards into programming language
datatypes.

It is the job of the Unicode Consortium to define the encoding
forms for representing Unicode code points, so that people
manipulating Unicode digital text representation can do so
reliably using general purpose programming languages with
well-defined textual data constructs. I 

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-06 Thread Philippe Verdy
- Original Message - 
From: Arcane Jill [EMAIL PROTECTED]
Probably a dumb question, but how come nobody's invented UTF-24 yet? I 
just made that up, it's not an official standard, but one could easily 
define UTF-24 as UTF-32 with the most-significant byte (which is always 
zero) removed, hence all characters are stored in exactly three bytes and 
all are treated equally. You could have UTF-24LE and UTF-24BE variants, 
and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly 
brilliant idea, but I just wonder why no-one's suggested it before.
UTF-24 already exists as an encoding form (it is identical to UTF-32), if 
you just consider that encoding forms just need to be able to represent a 
valid code range within a single code unit.
UTF-32 is not meant to be restricted on 32-bit representations.

However it's true that UTF-24BE and UTF-24LE could be useful as a encoding 
schemes for serializations to byte-oriented streams, suppressing one 
unnecessary byte per code point.

(And then of course, there's UTF-21, in which blocks of 21 bits are 
concatenated, so that eight Unicode characters will be stored in every 21 
bytes - and not to mention UTF-20.087462841250343, in which a plain text 
document is simply regarded as one very large integer expressed in radix 
1114112, and whose UTF-20.087462841250343 representation is simply that 
number expressed in binary. But now I'm getting /very/ silly - please 
don't take any of this seriously.)  :-)
I don't think that UTF-21 would be useful as an encoding form, but possibly 
as a encoding scheme where 3 always-zero bits would be stripped, providing a 
tiny compression level, which would only be justified for transmission over 
serial or network links.

However I do think that such optimization would have the effect of 
removing byte alignments, on which more powerful compressors are working. If 
you really need a more effective compression use SCSU or apply some deflate 
or bzip2 compression to UTF-8, UTF-16, or UTF-24/32... (there's not much 
difference between compressing UTF-24 or UTF-32 with generic compression 
algorithms like deflate or bzip2).

The UTF-24 thing seems a reasonably sensible question though. Is it just 
that we don't like it because some processors have alignment restrictions 
or something?
There does exists, even still today, 4-bit processors, and 1-bit processors, 
where the smallest addressable memory unit is smaller than 8-bit. They are 
used for lowcost micro-devices, notably to build automated robots for the 
industry, or even for many home/kitchen devices. I don't know whever they do 
need Unicode to represent international text, given that they often have a 
very limited user interface, incapable of inputing or output text, but who 
knows? May be they are used in some mobile phones, or within smart 
keyboards or tablets or other input devices connected to PCs...

There also exists systems where the smallest addressable memory cell is a 
9-bit byte. This is more an issue here, because the Unicode standard does 
not specify whever encoding schemes (that serialize code points to bytes) 
should set the 9th bit of each byte to 0, or should fill every 8 bit of 
memory, even if this means that 8-bit bytes of UTF-8 will not be 
synchronized with memory 9-bit bytes.

Somebody already introduced UTF-9 in the past for 9-bit systems.
A 36-bit processor could as well address the memory by cells of 36 bits, 
where the 4 highest bits would be either used for CRC control bits 
(generated and checked automatically by the processor or a memory bus 
interface within memory regions where this behavior would be allowed), or 
either used to store supplementary bits of actual data (in unchecked regions 
that fit in reliable and fast memory, such as the internal memory cache of 
the CPU, or static CPU registers).

For such things, the impact of the transformation of addressable memory 
widths through interfaces is for now not discussed in Unicode, which 
supposes that internal memory is necessarily addressed in a power of 2 and a 
multiple of 8 bits, and then interchanged or stored using this byte unit.

Today, we assist to the constant expansion of bus widths to allow parallel 
processing instead of multiplying the working frequency (and the energy 
spent and temperature, which generates other environmental problems), so why 
the 8-bit byte unit would remain the most efficient universal unit? If you 
look at IEEE floatting point formats, they are often implemented in FPU 
working on 80-bit units, and a 80-bit memory cell could as well become 
tomorrow a standard (compatible with the increasingly used 64-bit 
architectures of today) which would no longer be a power of 2 (even if this 
stays a multiple of 8 bits).

On a 80-bit system, the easiest solution for handling UTF-32 without using 
too much space would be a unit of 40-bits (i.e. two code points per 80-bit 
memory cell). But if you consider that 21 bits only are used in Unicode,