Thanx for the good link ! It also writes :
===================================================
Q: What’s the algorithm to convert from UTF-16 to character codes?

A: The Unicode Standard used to contain a short algorithm, now there
is just a bit distribution table. Here are three short code snippets
that translate the information from the bit distribution table into C
code that will convert to and from UTF-16.

Using the following type definitions

typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;
the first snippet calculates the high (or leading) surrogate from a
character code C.

const UTF16 HI_SURROGATE_START = 0xD800
UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;
where X, U and W correspond to the labels used in Table 3-5 UTF-16 Bit
Distribution. The next snippet does the same for the low surrogate.

const UTF16 LO_SURROGATE_START = 0xDC00
UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));
Finally, the reverse, where hi and lo are the high and low surrogate,
and C the resulting character

UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;
UTF32 C = U << 16 | X;
A caller would need to ensure that C, hi, and lo are in the
appropriate ranges. [AF]

Q: Isn’t there a simpler way to do this?

A: There is a much simpler computation that does not try to follow the
bit distribution table.

// constants
const UTF32 LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
const UTF32 SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

// computations
UTF16 lead = LEAD_OFFSET + (codepoint >> 10);
UTF16 trail = 0xDC00 + (codepoint & 0x3FF);

UTF32 codepoint = (lead << 10) + trail + SURROGATE_OFFSET;
[MD]

Q: Why are some people opposed to UTF-16?

A: People familiar with variable width East Asian character sets such
as Shift-JIS ( SJIS) are understandably nervous about UTF-16, which
sometimes requires two code units to represent a single character.
They are well acquainted with the problems that variable-width codes
have caused. However, there are some important differences between the
mechanisms used in SJIS and UTF-16:

Overlap:

In SJIS, there is overlap between the leading and trailing code unit
values, and between the trailing and single code unit values. This
causes a number of problems:
It causes false matches. For example, searching for an “a” may match
against the trailing code unit of a Japanese character.
It prevents efficient random access. To know whether you are on a
character boundary, you have to search backwards to find a known
boundary.
It makes the text extremely fragile. If a unit is dropped from a
leading-trailing code unit pair, many following characters can be
corrupted.
In UTF-16, the code point ranges for high and low surrogates, as well
as for single units are all completely disjoint. None of these
problems occur:
There are no false matches.
The location of the character boundary can be directly determined from
each code unit value.
A dropped surrogate will corrupt only a single character.
Frequency:

The vast majority of SJIS characters require 2 units, but characters
using single units occur commonly and often have special importance,
for example in file names.
With UTF-16, relatively few characters require 2 units. The vast
majority of characters in common use are single code units. Even in
East Asian text, the incidence of surrogate pairs should be well less
than 1% of all text storage on average. (Certain documents, of course,
may have a higher incidence of surrogate pairs, just as phthisique is
an fairly infrequent word in English, but may occur quite often in a
particular scholarly text.) [AF]

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
mseide-msegui-talk mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mseide-msegui-talk

Reply via email to