RE: NUL-transparent Java-UTF-8

Kent Karlsson Sun, 29 Dec 2002 07:02:56 -0800

Well, the format is specified in:

http://java.sun.com/j2se/1.4.1/docs/api/java/io/DataOutput.html#writeUTF(java.lang.String)


The argument is a Java String (i.e. a UTF-16 string), so "overlong (UTF-8-ish)
codes" in the input is not an issue (malformed UTF-16 is an issue though).

Note that this function **ALSO** writes a two-byte length information
in beginning of the output of each call to this method.  The overlong
NULL representation and the added length information are not the only
modifications; for surrogate code points, the output is also "Oraclesque",
each surrogate code point is output as three bytes, with no pairing of
surrogates.  The method does NOT write a NULL (of any kind) at the end of
the string, and I have no idea why it modifies the representation of NULL.

The proper conversions in Java to/from UTF-8 (from String, that are in
UTF-16 in Java) are instead found via
java.nio.charset.spi.CharsetProvider.charsetForName(String charsetName),
which returns a "Charset" for UTF-8 if the charsetName parameter is "UTF-8".

This "Charset" can then be used to create reader and writer objects:
java.io.InputStreamReader(InputStream in, Charset cs) 
          Create an InputStreamReader that uses the given charset. 
java.io.OutputStreamWriter(OutputStream out, Charset cs) 
          Create an OutputStreamWriter that uses the given charset. 

(or use the shortcuts:
java.io.InputStreamReader(InputStream in,
                         String charsetName)
                  throws UnsupportedEncodingException
        Create an InputStreamReader that uses the named charset. 
java.io.OutputStreamWriter(OutputStream out,
                          String charsetName)
                   throws UnsupportedEncodingException
        Create an OutputStreamWriter that uses the named charset.)

These objects can then be used to read and write characters.

Etc, with BufferedReader/Writer on top of that...

The writeUTF and readUTF methods should **ONLY** be used for Java
object serialisation (for String object attributes), NOT for string I/O.
They do NOT convert to/from UTF-8, but something else, entirely
internal to Java.

                Happy New 2003
                /kent k

PS
NULL **is** "just another UnicoXXXXX ISO 6421 control character".
It's C that misbehaves...



> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Markus Kuhn
> Sent: den 22 december 2002 14:32
> To: [EMAIL PROTECTED]
> Subject: Re: NUL-transparent Java-UTF-8
> 
> 
> Henry Spencer wrote on 2002-12-20 23:38 UTC:
> > It might be worth mention, because Java's not the only thing using it.
> > It's actually quite convenient to be able to make applications 
> > NUL-transparent without having to recode all the string operations.
> 
> Is there a proper full specification of this encoding somewhere
> online? Merely replacing 0x00 with its overlong UTF-8 equivalent
> 0xc0 0x80 can't be the full story, because what you are interested
> in the end must surely be binary transparency, not merely
> NUL-transparency. I don't see what NUL-transparency alone would
> be good for, as NUL is usually only a problem in arbitrary binary
> strings.
> 
> So you also have to specify how to represent any byte sequence
> including overlong UTF-8 sequences such as 0xc0 0x80. Until someone
> shows me the full spec behind this frequently quoted but unnamed
> Java derivative of UTF-8, I am not yet convinced that it is useful
> for anything in practice.
> 
> Markus
> 
> -- 
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>
> --
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/
> 

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

RE: NUL-transparent Java-UTF-8

Reply via email to