Thanx for the good link ! It also writes : =================================================== Q: What’s the algorithm to convert from UTF-16 to character codes?
A: The Unicode Standard used to contain a short algorithm, now there is just a bit distribution table. Here are three short code snippets that translate the information from the bit distribution table into C code that will convert to and from UTF-16. Using the following type definitions typedef unsigned int16 UTF16; typedef unsigned int32 UTF32; the first snippet calculates the high (or leading) surrogate from a character code C. const UTF16 HI_SURROGATE_START = 0xD800 UTF16 X = (UTF16) C; UTF32 U = (C >> 16) & ((1 << 5) - 1); UTF16 W = (UTF16) U - 1; UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10; where X, U and W correspond to the labels used in Table 3-5 UTF-16 Bit Distribution. The next snippet does the same for the low surrogate. const UTF16 LO_SURROGATE_START = 0xDC00 UTF16 X = (UTF16) C; UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1)); Finally, the reverse, where hi and lo are the high and low surrogate, and C the resulting character UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1); UTF32 W = (hi >> 6) & ((1 << 5) - 1); UTF32 U = W + 1; UTF32 C = U << 16 | X; A caller would need to ensure that C, hi, and lo are in the appropriate ranges. [AF] Q: Isn’t there a simpler way to do this? A: There is a much simpler computation that does not try to follow the bit distribution table. // constants const UTF32 LEAD_OFFSET = 0xD800 - (0x10000 >> 10); const UTF32 SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00; // computations UTF16 lead = LEAD_OFFSET + (codepoint >> 10); UTF16 trail = 0xDC00 + (codepoint & 0x3FF); UTF32 codepoint = (lead << 10) + trail + SURROGATE_OFFSET; [MD] Q: Why are some people opposed to UTF-16? A: People familiar with variable width East Asian character sets such as Shift-JIS ( SJIS) are understandably nervous about UTF-16, which sometimes requires two code units to represent a single character. They are well acquainted with the problems that variable-width codes have caused. However, there are some important differences between the mechanisms used in SJIS and UTF-16: Overlap: In SJIS, there is overlap between the leading and trailing code unit values, and between the trailing and single code unit values. This causes a number of problems: It causes false matches. For example, searching for an “a” may match against the trailing code unit of a Japanese character. It prevents efficient random access. To know whether you are on a character boundary, you have to search backwards to find a known boundary. It makes the text extremely fragile. If a unit is dropped from a leading-trailing code unit pair, many following characters can be corrupted. In UTF-16, the code point ranges for high and low surrogates, as well as for single units are all completely disjoint. None of these problems occur: There are no false matches. The location of the character boundary can be directly determined from each code unit value. A dropped surrogate will corrupt only a single character. Frequency: The vast majority of SJIS characters require 2 units, but characters using single units occur commonly and often have special importance, for example in file names. With UTF-16, relatively few characters require 2 units. The vast majority of characters in common use are single code units. Even in East Asian text, the incidence of surrogate pairs should be well less than 1% of all text storage on average. (Certain documents, of course, may have a higher incidence of surrogate pairs, just as phthisique is an fairly infrequent word in English, but may occur quite often in a particular scholarly text.) [AF] ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ mseide-msegui-talk mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mseide-msegui-talk

