From: "Simon Cozens" <[EMAIL PROTECTED]>
 
> UTF-8 uses surrogates two, which is screaming difficult to process.

But Unicode already has combining characters: we already must (or should)
check that a string does not start with a combing character before cutting
or pasting it, etc. And the idea of what a character is is rather arbitrary in,
say, Indic scripts anyway: is it a graphical unit or syllabic unit, etc.

> > other than Java internally with their recent APIs. So I'd like to know
> > more about whether Japanese and Chinese people are really using
> > something other than Unicode or whether they are just using variant
> > encodings for data that their software treats internally as Unicode.
> 
> I have a very strong suspicion it depends on the nationality of the
> programmer. :) (And we're supposed to be generating programming languages
> for programmers...)

Database systems tend to store data either as 8-bit or 16-bit.  I have seen an 
example of people using characters in Java but keeping Big5 code
points in them. The Java character is 16 bit, so it fits, but of course they
cannot use the built-in string handling libraries.  This may be efficient
because it removes the need for transcoding, but it will be confusing unless
the programmer documents that the String they are using is not in fact
a Unicode String but a Big-5 string converted to 16-bit fixed and
plonked in.  It is a convenient hack, but will make life very difficult
for programmers if they need their system to support real Unicode
characters ever.

Cheers
Rick Jelliffe


Reply via email to