Fw: Nicest UTF

Philippe Verdy Sun, 05 Dec 2004 14:58:39 -0800

From: "Doug Ewell" <[EMAIL PROTECTED]>

Here is a string, expressed as a sequence of bytes in SCSU:

05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E

See how long it takes you to decode this to Unicode code points.  (Do
not refer to UTN #14; that would be cheating. :-)

Without looking at it, it's easy to see that this tream is separated in three sections, initiated by 05 1C, then 05 1D, then 12. I can't remember without looking at the UTN what they perform (i.e. which Unicode code points range they select), but the other bytes are simple offsets relative to the start of the selected ranges. Also the third section is ended by a regular dot (2E) in the ASCII range selected for the low half-page, and the other bytes are offsets for the script block initiated by 12.

Immediately I can identify this string, without looking at any table:

"Mossov?" is ??????.

where " is some openining or closing quotation mark and where each ?
replaces a character that I can't decipher only through my
defective memory. (I don't need to remember the details of the standard
table of ranges, because I know that this table is complete in a small and
easily available document).

A computer can do this much better than I can (also it can even "know" much
better than I can what corresponds to a given code point like U+6327, if it
is effectively assigned; I'll have to look into a specification or to use a
charmap tool, if I'm not used to enter this character in my texts).

The decoder part of SCSU still remains extremely trivial to implement, given the small but complete list of codes that can alter the state of the decoder, because there's no choice in its interpretation and because the set of variables to store the decoder state is very limited, as well as the number of decision tests at each step. This is a basic "finite state automata".

Only the encoder may be a bit complex to write (if one wants to generate the optimal smallest result size), but even a moderate programmer could find a simple and working scheme with a still excellent compression rate (around 1 to 1.2 bytes per character on average for any Latin text, and around 1.2 to 1.5 bytes per character for Asian texts which would still be a good application of SCSU face to UTF-32 or even UTF-8).

Fw: Nicest UTF

Reply via email to