*.uni

Eric Blake Sun, 17 Feb 2002 06:53:39 -0800

Brian Jones wrote:
> 
> As I recall Unicode now requires more bits than a Java 'char' allows.
> I don't know that helps at all?  I don't really know what Sun's
> solution is.  It looks like we did update to unicode data 3.0, but I
> know our implementation fails many Mauve tests related to Character.


Unicode 3.1 introduced several code points in the surrogate space.  And
the upcoming 3.2 adds even more.  These characters require two 16-bit
fields to represent them (the first in \ud800 - \udb7f, the second in
\udc00 - \udfff).  And Java does ignore these - the 4-byte abbreviation
sequences of UTF-8 are illegal in class files (you have to use a 6-byte
sequence instead), and Java identifiers may not include surrogate
characters.  Sun would need to add more methods to the API to use them,
because the point of surrogates is that two characters together have
semantic meaning, while one alone is an error.  For example, it is
impossible to tell if \ud820 in isolation is part of a letter, number,
or punctuation.  So for now, Sun's "solution" is to stall.  I did verify
today that JDK 1.4 is still on Unicode 3.0.0.

The implementation of Character that I just checked in to Classpath is
identical in behavior to Sun's (fortunately, testing every method on all
64k chars is not terribly time-consuming).  However, I could not run it
through Mauve; as I still have been unable to compile a free VM on
cygwin, and Sun's VM doesn't like me replacing core classes like
Character.  But if Character fails any tests in Mauve now, then I would
suspect that Mauve has the bugs.

-- 
This signature intentionally left boring.

Eric Blake             [EMAIL PROTECTED]
  BYU student, free software programmer


_______________________________________________
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

Reply via email to