Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

Philippe Verdy Mon, 15 Nov 2004 05:36:36 -0800

----- Original Message ----- From: "John Cowan" <[EMAIL PROTECTED]> To: "Doug Ewell" <[EMAIL PROTECTED]> Cc: "Unicode Mailing List" <[EMAIL PROTECTED]>; "Philippe Verdy" <[EMAIL PROTECTED]>; "Peter Kirk" <[EMAIL PROTECTED]> Sent: Monday, November 15, 2004 7:05 AM Subject: Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

Doug Ewell scripsit:
As soon as you can think of one, let me know.  I can think of plenty of
*binary* protocols that require zero bytes, but no *text* protocols.
Most languages other than C define a string as a sequence of characters rather than a sequence of non-null characters. The repertoire of characters than can exist in strings usually has a lower bound, but its full magnitude is implementation-specific. In Java, exceptionally, the repertoire is defined by the standard rather than the implementation, and it includes U+0000. In any case, I can think of no language other than C which does not support strings containing U+0000 in most implementations.

This is exactly the inclusion of U+0000 as a valid character in Java strings that requires that this character be preserved in the JNI interface and in String serializations.

Some are thinking here that this is a broken behavior, but there's no other wimple way to represent this character when passing a Java String instance to and from a JNI interface, or though serialization such as in class files.

My opinion is that the Java behavior does not define a new encoding, it is rather a transfer encoding syntax (TES), so that it can effectively serialize String instances (which are UCS-2 encoded using the 16-bit "char" Java datatype, and not only the UTF-16 restriction of UCS-2 which also requires paired surrogates, but does not make the '\uFFFF' and '\uFFFE' char or code unit illegal as they are simply mapped to U+FFFF and U+FFFE code points, even if these code points are permanently assigned as non-characters in Unicode and ISO/IEC 10646).

The internal working storage of Java Strings is not a character set (CCS or CES), and these strings are not necessarily bound to Unicode (even if Java provides lots of Unicode-based character properties, and character sets conversion libraries), as they can store as well other charsets, using other charset encoding/decoding libraries than those found in java.io.* and java.text.* packages. Once you admit that, Java String instances are just arrays of code units, not arrays of code points, their interpretation as encoded characters being left to other layers.

Should there exist any successor to Unicode (or a preference in a Chinese implementation to handle String instances internally with GB18030), with different mappings from code units to code points and characters, the working model of Java String instances and "char" datatype would not be affected. This would still be conforming to Java specifications, if the standard java.text.* and java.io.* or java.nio.* packages that perform the various mappings between code units and code points, characters and byte streams are not modified: new alternate packages could be used, without changing the String object and the unsigned 16-bit integer "char" datatype.

In Java 1.5, Sun chose to support supplementary characters without changing the char and String representations, but the "Character" object was extended to support the static representation of code points as static 32-bit "int", and include the mapping from any Unicode code points in the 17 planes with "char" code units. The String class has then been extended to allow parsing "char"-encoded strings by "int" code points (so with the automatic support and detection of surrogate pairs), but the legacy interface was preserved. In ICU4J, the "UCharacter" object does not use a static representation but stores code points directly as "int", unlike "Character" whose instances still only store a single 16-bit "char", and offers only a static support for code points: there's still no "Character(int codepoint)" constructor, only a "Character(char codeunit)", because "Character" keeps its past serialization for compatibility, and "Character" is also bound to the 16-bit "char" datatype for object-boxing (automatic boxing only exists in Java 1.5, explicit boxing in previous and current versions is still supported).

If Java needs some more extension, it's to include the ICU4J "UCharacter" class that would allow storing 32-bit "int" codepoints, or building a UCharacter from a "char"-coded surrogates pair of code units, or from a "Character" instance; and also to add a "UString" class using internally arrays of "int"-coded code units, with converters between String and UString. Such extension would not need any change in the JVM, just new supported packages.

But even with all these extensions, the U+0000 Unicode character would remain valid and supported, and there would still remain the need to support it in JNI and in internal JVM serializations for String instances. I really don't like the idea of some people here that would want to deprecate a widely used JNI interface that needs this special serialization when interfacing with C code.

Also the fact that C assigns a role to the all-bits-set-to-zero char as a end-of-string terminator does not imply that C is unable to represent the NULL character, given that all other "char" values have *no* required semantics or values (for example '\r' or '\n' are not bound to fixed Unicode characters, but to functions, and this interpretation remains compiler- and platform-specific)

Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

Reply via email to