Re: U+0000 in C strings

Philippe Verdy Mon, 15 Nov 2004 10:51:08 -0800

From: "Doug Ewell" <[EMAIL PROTECTED]>

How does Java indicate the end of a string?  It can't use the value
U+0000, as C does, because the "modified UTF-8" sequence C0 80 still
gets translated as U+0000.  And if the answer is that Java uses a length
count, and therefore doesn't care about zero bytes, then why is there a
need to encode U+0000 specially?

You seem to assume that Java (the language) uses this sequence. In fact the sequence is not for use by Java itself, but in its interfaces with other languages, including C. In 100% pure Java programming, you never see that sequence, you just work with any UCS-2 code units when parsing "String" instances or comparing "char" values. And if you perform I/O using the supported "UTF-8" Charset instance, Java properly encodes it as a single null byte. So why do people think that UTF-8 support in Java is broken? It is not. The "modified UTF-8" encoding is only for use in the serialization of compiled classes that contain a constant string pool, and through the JNI interface to C-written modules using the legacy *UTF() APIs that want to work with C strings.

There's no requirementto use that legacy *UTF() interface in C, because you wan also use the UTF-26 interface which does not require that the JavaVM allocates a bytes buffer to perform the conversion to internal String storage (the UTF-16 JNI interface is more efficient, just a bit more complex to handle in C when you only use the standard char-based C library; if you use the wchar_t-based C library, you don't need this legacy interface, but support of wchar_t in standard libraries is not guaranteed on all platforms, even if there's a wchar_t datatype in the C libraries and headers, as "wchar_t" is allowed in ANSI C to be defined equal to "char", i.e. only 1 byte; if this is the case, the C program will need to use something else than "wchar_t", for example "unsigned short", if it is at least 16-bits).

On Windows, Unix, Linux and Mac OS or OSX, modern C compilers support wchar_t with at least 16 bits, so this is not a problem. It may still be a problem if wchar_t is 32-bits, because the fastest UTF-16 based JNI interface requires 16-bits code units: in that case you won't be able to use wsrtlen() and so on...

With the fastest JNI 16-bit interface, note that "wide-string" C libraries assume that U+0000 coded as a single null wchar_t code unit is a end-of-string terminator; so if this is an issue, and your external C-written JNI component should work with any Java String instances, you'll need instead to use memcpy() and similar functions with the separate length indicator to access to all characters of a Java String instance. Such complication is not always necessary and sometimes causes unneeded errors. Using the legacy *UTF() JNI interface will solve the security risk or interpretation issue...

Re: U+0000 in C strings

Reply via email to