Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

Philippe Verdy Tue, 07 Dec 2004 14:33:58 -0800

From: "Kenneth Whistler" <[EMAIL PROTECTED]>

Yes, and pigs could fly, if they had big enough wings.

Once again, this is a creative comment. As if Unicode had to be bound on architectural constraints such as the requirement of representing code units (which are architectural for a system) only as 16-bit or 32-bit units, ignoring the fact that technologies do evolve and will not necessarily keep this constraint. 64-bit systems already exist today, and even if they have, for now, the architectural capability of handling efficiently 16-bit and 32-bit code units so that they can be addressed individually, this will possibly not be the case in the future.

When I look at the encoding forms such as UTF-16 and UTF-32, they just define the value ranges in which code units will be be valid, but not necessarily their size. You are mixing this with encoding schemes, which is what is needed for interoperability, and where other factors such as bit or byte ordering is also important in addition to the value range.

I won't see anything wrong if a system is set so that UTF-32 code units will be stored in 24-bit or even 64-bit memory cells, as long as they respect and fully represent the value range defined in encoding forms, and if the system also provides an interface to convert them with encoding schemes to interoperable streams of 8-bit bytes.

Are you saying that UTF-32 code units need to be able to represent any 32-bit value, even if the valid range is limited, for now to the 17 first planes? An API on a 64-bit system that would say that it requires strings being stored with UTF-32 would also define how UTF-32 code units are represented. As long as the valid range 0 to 0x10FFFF can be represented, this interface will be fine. If this system is designed so that two or three code units will be stored in a single 64-bit memory cell, no violation will occur in the valid range.

More interestingly, there already exists systems where memory is adressable by units of 1 bit, and on these systems, an UTF-32 code unit will work perfectly if code units are stored by steps of 21 bits of memory. On 64-bit systems, the possibility of addressing any groups individual bits will become an interesting option, notably when handling complex data structures such as bitfields, data compressors, bitmaps, ... No more need to use costly shifts and masking. Nothing would prevent such system to offer interoperability with 8-bit byte based systems (note also that recent memory technologies use fast serial interfaces instead of parallel buses, so that the memory granularity is less important).

The only cost for bit-addressing is that it just requires 3 bits of address, but in a 64-bit address, this cost seems very low becaue the global addressable space will still be... more than 2.3*10^18 bytes, much more than any computer will manage in a single process for the next century (according to the Moore's law which doubles the computing capabilities every 3 years). Even such scheme would not limit the performance given that memory caches are paged, and these caches are always increasing, eliminating most of the costs and problems related to data alignment experimented today on bus-based systems.

Other territories are also still unexplored in microprocessors, notably the possibility of using non-binary numeric systems (think about optical or magnetic systems which could outperform the current electric systems due to reduced power and heat caused by currents of electrons through molecular substrates, replacing them by shifts of atomic states caused by light rays, and the computing possibilities offered by light diffraction through cristals). The lowest granularity of information in some future may be larger than a dual-state bit, meaning that todays 8-bit systems would need to be emulated using other numerical systems... (Note for example that to store the range 0..0x10FFFF, you would need 13 digits on a ternary system, and to store the range of 32-bit integers, you would need 21 ternary digits; memry technologies for such systems may use byte units made of 6 ternary digits, so programmers would have the choice between 3 "ternary bytes", i.e. 18 ternary digits, to store our 21-bit code units, or 4 "ternary bytes", i.e. 24 ternary digits or more than 34 binary bits, to be able to store the whole 32-bit range.)

Nothing there is impossible for the future (when it will become more and more difficult to increase the density of transistors, or to reduce further the voltage, or to increase the working frequency, or to avoid the inevitable and random presence of natural defects in substrates; escaping from the historic binary-only systems may offer interesting opportunities for further performance increase).

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

Reply via email to