Re: [d'oh!] java APIs are not powerful enough to handle the XML spec!!

Stefano Mazzocchi Fri, 14 Nov 2003 08:01:19 -0800

On 13 Nov 2003, at 21:08, J.Pietschmann wrote:

Stefano Mazzocchi wrote:
The day somebody asks you why java needs to be replaced, one answer will be 'it only supports 16-bits chars'. laughable as it might seem, it's true.
yes, people, a Unicode char is not 16 bit (as I always though!) but 32!!

This is a misconception.

yeah, well, I'm not talking about the encoding, but about the fact that you can't fit all unicode chars in 16 bits address space, that was my point.

UTF-32 uses 32 bit flat encoding. UTF-16 and UTF-8 use a different type of encoding (which is the same I used in the SAX compiler that cocoon uses).

Unicode is an odd mixture: at the same time it defines codepoints for
representing characters and "surrogate characters" for encoding
non-baseplane characters (whose codepoints don't fit into 16 bit).

ISO 10646 originally intended to use full 32bit for 2^64 characters.
Because of slow progress an complaints about "wasting space", the
Unicode consortium was formed which made quick progress on specifying
a 16-bit charcater set. The surrogate characters were built in in case
more than 2^16 characters came along, and for giving people plenty of
room to experiment themself in the "private areas" there. Meanwhile,
ISO-10646 and Unicode converged: ISO limited the charset to 0x110000
characters, which should be enough for everyone, and Unicode dropped
the "16 bit charset" notation, they just define codepoints.
Unfortunately for them, they can't undo the surrogate character mess
and other wicked problems they now like to get rid of (singletons,
certain compatibility characters, some presentation forms, ligatures).

I'm very ignorant on these things, I must admit! thanks for sharing.

A Java "char" variable can't hold non-baseplane Unicode charaters, but
Java strings can. For Sun JVMs, they are basically a UTF-16 encoded
Unicode strings. BTW there are JVMs out there which use UTF-8 in
Java Strings, the same way strings are stored in class files.

The point is of course: can the run time libraries handle non-baseplane
characters?

It's even worse! Is javac able to handle UTF-16 encoded files? If so, would it be able to do:

String nonBaseplaneString = "... some non-baseplane chars here ";

and what would be the use of this, if I can't guarantee that

(new String(nonBaseplaneString.toCharArray()).equals(nonBaseplaneString)

will yield true all the time?

The java.text.BreakIterator can, but that's no magic. I
have no idea whether for example AWT display routines can display non-
baseplane characters, mainly because I've yet to get an appropriate
font. The TTF unicode mapping tables allocate, lo and behold, 16 bits
for the character. Who's complaining about Java?
BTW Mozilla can't deal with non-baseplane characters either, to the
chagrin of the MathML folks who use them for mathematical presentation
forms. Guess what's the main reason, beside fonts: C's wchar_t is 16 bit
too.

well, to be honest, I thought as well that moving from 8 bits to 16 bits for address space would have solved all our issues with chars once and for all... so I don't feel like blaming them for not having thought of more complex issues :-/

now, if you thought you could take the character() SAX event and create a String out of it and do something useful with is (like print it, for example), forget it. The result will very likely not be the one you expect.

That's an interesting observation. I never had problems in this area.
But this may have something to do with the fact that I never went out of
the Unicode baseplane with my chars.

Yeah, nobody ever did (this came out after testing Slide for webdav compliance)... but I have the feeling this will bite us in the back in the future.

Another reason not to use Stings at all.
Stings are bad, of course :-)

:-)

Strings are another matter. In fact, Strings should be preferred over
char arrays because they can hide the actual representation of the Unicode
strings.

Very true! Missed that.

If you use character arrays, you have to deal with surrogate
character pairs yourseelf. A substring() could be implemented to deal
with non-baseplane characters correctly. Of course, Java was invented
when people thought of Unicode as 16 bit charset, and the standardized
behaviour is that the String methods operate on the internal char array.

talking about a mess :-(

--
Stefano.

smime.p7s
Description: S/MIME cryptographic signature

Re: [d'oh!] java APIs are not powerful enough to handle the XML spec!!

Reply via email to