On 01/13/2011 11:00 PM, Nick Sabalausky wrote:
"Andrei Alexandrescu"<[email protected]>  wrote in message
news:[email protected]...

This may sometimes not be what the user expected; most of the time they'd
care about the code points.


I dunno, spir has succesfuly convinced me that most of the time it's
graphemes the user cares about, not code points. Using code points is just
as misleading as using UTF-16 code units.

You are right in that those 2 issues are really analog. In practice, once universal text is truely and commonly used, I guess problems with codes-do-not-represent-characters may become far more obvious; and also far more serious because (logical) errors can easily pass by unseen. [In fact, how can a programmer even know for instance that a search routine missed its target or returned a false positive, when dealing with characters from unknown languages? Indeed, there are test data sets, but they are useless if the tools one uses just ignore the issues.] The problem with using 16-bit representation and thus ignoring a fair amount of codepoints is maybe less problematic because there are rather few chances to randomly meet characters outside the BMP (Basic Multiligual Plane, part of UCS which codepoints are < 0x10000). Outside the BMP are scripting systems of less commonly studied archeological languages, and various sets of images such as alchemical symbols, playing cards or domino tiles. I doubt they'll ever be commonly used, or else for specialised apps the programmer perfectly knows what they deal with.

A list of UCS blocks with pointers to detailed content can be found here:
http://www.fileformat.info/info/unicode/block/index.htm
Blocks over the BMP start with the line:
Linear B Syllabary      U+10000         U+1007F         (88)

Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to