Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
Philippe stated, and I need to correct: UTF-24 already exists as an encoding form (it is identical to UTF-32), if you just consider that encoding forms just need to be able to represent a valid code range within a single code unit. This is false. Unicode encoding forms exist by virtue of the establishment of them as standard, by actions of the standardizing organization, the Unicode Consortium. UTF-32 is not meant to be restricted on 32-bit representations. This is false. The definition of UTF-32 is: The Unicode encoding form which assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value. It is true that UTF-32 could be (and is) implemented on computers which hold 32-bit numeric types transiently in 64-bit registers (or even other size registers), but if an array of 64-bit integers (or 24-bit integers) were handed to some API claiming to be UTF-32, it would simply be nonconformant to the standard. UTF-24 does not already exist as an encoding form -- it already exists as one of a large number of more or less idle speculations by character numerologists regarding other cutesy ways to handle Unicode characters on computers. Many of those cutesy ways are mere thought experiments or even simply jokes. However it's true that UTF-24BE and UTF-24LE could be useful as a encoding schemes for serializations to byte-oriented streams, suppressing one unnecessary byte per code point. Could be, perhaps, but is not. Implementers using UTF-32 for processing efficiency, but who have bandwidth constraints in some streaming context should simply use one of the CES's with better size characteristics or use a compression on their data. Note that 64-bit systems could do the same: 3 code points per 64-bit unit, requires only 63 bits, that are stored in a single positive 64-bit integer (the remaining bit would be the sign bit, always set to 0, avoiding problems related to sign extensions). And even today's system could use such representation as well, given that most 32-bit processors of today also have the internal capabilities to manage 64-bit integers natively. This is just an incredibly bad idea. Packing instructions in large-word microprocessors is one thing. You have built-in microcode which handles that, hidden away from application-level programming, and carefully architected for maximal processor efficiency. But attempting to pack character data into microprocessor words, just because you have bits available, would just detract from the efficiency of handling that data. Storage is not the issue -- you want to get the characters in and out of the registers as efficiently as possible. UTF-32 works fine for that. UTF-16 works almost as well, in aggregate, for that. And I could care less that when U+0061 goes in a 64-bit register for manipulation, the high 57 bits are all set to zero. Strings could be encoded as well using only 64-bit code units that would each store 1 to 3 code points, Yes, and pigs could fly, if they had big enough wings. the unused positions being filled with invalid codepoints out the Unicode space (for example by setting all 21 bits to 1, producing the out-of-range code point 0x1F, used as a filler for missing code points, notably when the string to encode is not an exact multiple of 3 code points). Then, these 64-bit code units could be serialized on byte streams as well, multiplying the number of possibilities: UTF-64BE and UTF-64LE? One interest of such scheme is that it would be more compact than UTF-32, because this UTF-64 encoding scheme would waste only 1 bit for 3 codepoints, instead 1 byte and 3 bits for each codepoint with UTF-32! Wow! You can imagine many other encoding schemes, depending on your architecture choices and constraints... Yes, one can imagine all sorts of strange things. I myself imagined UTF-17 once. But there is a difference between having fun imagining strange things and filling the list with confusing misinterpretations of the status and use of UTF-8, UTF-16, and UTF-32. --Ken
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
Yes, and pigs could fly, if they had big enough wings. An 8-foot wingspan should do it. For picture of said flying pig see: http://www.cincinnati.com/bigpiggig/profile_091700.html http://www.cincinnati.com/bigpiggig/images/pig091700.jpg Rick
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
From: Kenneth Whistler [EMAIL PROTECTED] Yes, and pigs could fly, if they had big enough wings. Once again, this is a creative comment. As if Unicode had to be bound on architectural constraints such as the requirement of representing code units (which are architectural for a system) only as 16-bit or 32-bit units, ignoring the fact that technologies do evolve and will not necessarily keep this constraint. 64-bit systems already exist today, and even if they have, for now, the architectural capability of handling efficiently 16-bit and 32-bit code units so that they can be addressed individually, this will possibly not be the case in the future. When I look at the encoding forms such as UTF-16 and UTF-32, they just define the value ranges in which code units will be be valid, but not necessarily their size. You are mixing this with encoding schemes, which is what is needed for interoperability, and where other factors such as bit or byte ordering is also important in addition to the value range. I won't see anything wrong if a system is set so that UTF-32 code units will be stored in 24-bit or even 64-bit memory cells, as long as they respect and fully represent the value range defined in encoding forms, and if the system also provides an interface to convert them with encoding schemes to interoperable streams of 8-bit bytes. Are you saying that UTF-32 code units need to be able to represent any 32-bit value, even if the valid range is limited, for now to the 17 first planes? An API on a 64-bit system that would say that it requires strings being stored with UTF-32 would also define how UTF-32 code units are represented. As long as the valid range 0 to 0x10 can be represented, this interface will be fine. If this system is designed so that two or three code units will be stored in a single 64-bit memory cell, no violation will occur in the valid range. More interestingly, there already exists systems where memory is adressable by units of 1 bit, and on these systems, an UTF-32 code unit will work perfectly if code units are stored by steps of 21 bits of memory. On 64-bit systems, the possibility of addressing any groups individual bits will become an interesting option, notably when handling complex data structures such as bitfields, data compressors, bitmaps, ... No more need to use costly shifts and masking. Nothing would prevent such system to offer interoperability with 8-bit byte based systems (note also that recent memory technologies use fast serial interfaces instead of parallel buses, so that the memory granularity is less important). The only cost for bit-addressing is that it just requires 3 bits of address, but in a 64-bit address, this cost seems very low becaue the global addressable space will still be... more than 2.3*10^18 bytes, much more than any computer will manage in a single process for the next century (according to the Moore's law which doubles the computing capabilities every 3 years). Even such scheme would not limit the performance given that memory caches are paged, and these caches are always increasing, eliminating most of the costs and problems related to data alignment experimented today on bus-based systems. Other territories are also still unexplored in microprocessors, notably the possibility of using non-binary numeric systems (think about optical or magnetic systems which could outperform the current electric systems due to reduced power and heat caused by currents of electrons through molecular substrates, replacing them by shifts of atomic states caused by light rays, and the computing possibilities offered by light diffraction through cristals). The lowest granularity of information in some future may be larger than a dual-state bit, meaning that todays 8-bit systems would need to be emulated using other numerical systems... (Note for example that to store the range 0..0x10, you would need 13 digits on a ternary system, and to store the range of 32-bit integers, you would need 21 ternary digits; memry technologies for such systems may use byte units made of 6 ternary digits, so programmers would have the choice between 3 ternary bytes, i.e. 18 ternary digits, to store our 21-bit code units, or 4 ternary bytes, i.e. 24 ternary digits or more than 34 binary bits, to be able to store the whole 32-bit range.) Nothing there is impossible for the future (when it will become more and more difficult to increase the density of transistors, or to reduce further the voltage, or to increase the working frequency, or to avoid the inevitable and random presence of natural defects in substrates; escaping from the historic binary-only systems may offer interesting opportunities for further performance increase).
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
Philippe continued: As if Unicode had to be bound on architectural constraints such as the requirement of representing code units (which are architectural for a system) only as 16-bit or 32-bit units, Yes, it does. By definition. In the standard. ignoring the fact that technologies do evolve and will not necessarily keep this constraint. 64-bit systems already exist today, and even if they have, for now, the architectural capability of handling efficiently 16-bit and 32-bit code units so that they can be addressed individually, this will possibly not be the case in the future. This is just as irrelevant as worrying about the fact that 8-bit character encodings may not be handled efficiently by some 32-bit processors. When I look at the encoding forms such as UTF-16 and UTF-32, they just define the value ranges in which code units will be be valid, but not necessarily their size. Philippe, you are wrong. Go reread the standard. Each of the encoding forms is *explicitly* defined in terms of code unit size in bits. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. If there is something ambiguous or unclear in wording such as that, I think the UTC would like to know about it. You are mixing this with encoding schemes, which is what is needed for interoperability, and where other factors such as bit or byte ordering is also important in addition to the value range. I am not mixing it up -- you are, unfortunately. And it is most unhelpful on this list to have people waxing on, with apparently authoritative statements about the architecture of the Unicode Standard, which on examination turn out to be flat wrong. I won't see anything wrong if a system is set so that UTF-32 code units will be stored in 24-bit or even 64-bit memory cells, as long as they respect and fully represent the value range defined in encoding forms, Correct. And I said as much. There is nothing wrong with implementing UTF-32 on a 64-bit processor. Putting a UTF-32 code point into a 64-bit register is fine. What you have to watch out for is handing me a 64-bit array of ints and claiming that it is a UTF-32 sequence of code points -- it isn't. and if the system also provides an interface to convert them with encoding schemes to interoperable streams of 8-bit bytes. No, you have to have an interface which hands me the correct data type when I declare it uint_32, and which gives me correct offsets in memory if I walk an index pointer down an array. That applies to the encoding *form*, and is completely separate from provision of any streaming interface that wants to feed data back and form in terms of byte streams. Are you saying that UTF-32 code units need to be able to represent any 32-bit value, even if the valid range is limited, for now to the 17 first planes? Yes. An API on a 64-bit system that would say that it requires strings being stored with UTF-32 would also define how UTF-32 code units are represented. As long as the valid range 0 to 0x10 can be represented, this interface will be fine. No, it will not. Read the standard. An API on a 64-bit system that uses an unsigned 32-bit datatype for UTF-32 is fine. It isn't fine if it uses an unsigned 64-bit datatype for UTF-32. If this system is designed so that two or three code units will be stored in a single 64-bit memory cell, no violation will occur in the valid range. You can do whatever the heck crazy thing you want to do internal to your data manipulation, but you cannot surface a datatype packed that way and conformantly claim that it is UTF-32. More interestingly, there already exists systems where memory is adressable by units of 1 bit, and on these systems, ... [excised some vamping on the future of computers] Nothing there is impossible for the future (when it will become more and more difficult to increase the density of transistors, or to reduce further the voltage, or to increase the working frequency, or to avoid the inevitable and random presence of natural defects in substrates; escaping from the historic binary-only systems may offer interesting opportunities for further performance increase). Look, I don't care if the processors are dealing in qubits on molecular arrays under the covers. It is the job of the hardware folks to surface appropriate machine instructions that compiler makers can use to surface appropriate formal language constructs to programmers to enable hooking the defined datatypes of the character encoding standards into programming language datatypes. It is the job of the Unicode Consortium to define the encoding forms for representing Unicode code points, so that people manipulating Unicode digital text representation can do so reliably using general purpose programming languages with well-defined textual data constructs. I
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
- Original Message - From: Arcane Jill [EMAIL PROTECTED] Probably a dumb question, but how come nobody's invented UTF-24 yet? I just made that up, it's not an official standard, but one could easily define UTF-24 as UTF-32 with the most-significant byte (which is always zero) removed, hence all characters are stored in exactly three bytes and all are treated equally. You could have UTF-24LE and UTF-24BE variants, and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly brilliant idea, but I just wonder why no-one's suggested it before. UTF-24 already exists as an encoding form (it is identical to UTF-32), if you just consider that encoding forms just need to be able to represent a valid code range within a single code unit. UTF-32 is not meant to be restricted on 32-bit representations. However it's true that UTF-24BE and UTF-24LE could be useful as a encoding schemes for serializations to byte-oriented streams, suppressing one unnecessary byte per code point. (And then of course, there's UTF-21, in which blocks of 21 bits are concatenated, so that eight Unicode characters will be stored in every 21 bytes - and not to mention UTF-20.087462841250343, in which a plain text document is simply regarded as one very large integer expressed in radix 1114112, and whose UTF-20.087462841250343 representation is simply that number expressed in binary. But now I'm getting /very/ silly - please don't take any of this seriously.) :-) I don't think that UTF-21 would be useful as an encoding form, but possibly as a encoding scheme where 3 always-zero bits would be stripped, providing a tiny compression level, which would only be justified for transmission over serial or network links. However I do think that such optimization would have the effect of removing byte alignments, on which more powerful compressors are working. If you really need a more effective compression use SCSU or apply some deflate or bzip2 compression to UTF-8, UTF-16, or UTF-24/32... (there's not much difference between compressing UTF-24 or UTF-32 with generic compression algorithms like deflate or bzip2). The UTF-24 thing seems a reasonably sensible question though. Is it just that we don't like it because some processors have alignment restrictions or something? There does exists, even still today, 4-bit processors, and 1-bit processors, where the smallest addressable memory unit is smaller than 8-bit. They are used for lowcost micro-devices, notably to build automated robots for the industry, or even for many home/kitchen devices. I don't know whever they do need Unicode to represent international text, given that they often have a very limited user interface, incapable of inputing or output text, but who knows? May be they are used in some mobile phones, or within smart keyboards or tablets or other input devices connected to PCs... There also exists systems where the smallest addressable memory cell is a 9-bit byte. This is more an issue here, because the Unicode standard does not specify whever encoding schemes (that serialize code points to bytes) should set the 9th bit of each byte to 0, or should fill every 8 bit of memory, even if this means that 8-bit bytes of UTF-8 will not be synchronized with memory 9-bit bytes. Somebody already introduced UTF-9 in the past for 9-bit systems. A 36-bit processor could as well address the memory by cells of 36 bits, where the 4 highest bits would be either used for CRC control bits (generated and checked automatically by the processor or a memory bus interface within memory regions where this behavior would be allowed), or either used to store supplementary bits of actual data (in unchecked regions that fit in reliable and fast memory, such as the internal memory cache of the CPU, or static CPU registers). For such things, the impact of the transformation of addressable memory widths through interfaces is for now not discussed in Unicode, which supposes that internal memory is necessarily addressed in a power of 2 and a multiple of 8 bits, and then interchanged or stored using this byte unit. Today, we assist to the constant expansion of bus widths to allow parallel processing instead of multiplying the working frequency (and the energy spent and temperature, which generates other environmental problems), so why the 8-bit byte unit would remain the most efficient universal unit? If you look at IEEE floatting point formats, they are often implemented in FPU working on 80-bit units, and a 80-bit memory cell could as well become tomorrow a standard (compatible with the increasingly used 64-bit architectures of today) which would no longer be a power of 2 (even if this stays a multiple of 8 bits). On a 80-bit system, the easiest solution for handling UTF-32 without using too much space would be a unit of 40-bits (i.e. two code points per 80-bit memory cell). But if you consider that 21 bits only are used in Unicode,