Re: Windows only bug: UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence

Jeremy Carroll Wed, 27 Apr 2005 05:23:53 -0700

Graham Leggett wrote:

Michael Glavassevich wrote:
This is neither a Xerces bug or a bug in Java.
Investigating this further, this is definitely a bug in Java which is then not trapped by Xerces - the FileWriter opens the file with an unpredictable encoding different per platform, which is definitely broken behaviour. Xerces then silently allows this problem to remain unchecked through encoding, when in reality it should have thrown an exception.

Rubbish. It is the documented behaviour. It is well motivated; it enables the Java app to talk with the OS and other apps in the expected encoding. It just isn't appropriate for WebApps. What is broken is using a FileWriter in code intended for a Web application. That is not the intended purpose of FileWriter's. FileOutputStream's are appropriate for this purpose, (potentially wrapped with a utf-8 OutputStreamWriter)

When you pass the serializer a Writer rather than an OutputStream it will write characters not bytes. If the Writer is writing to an OutputStream, it is responsible for encoding the characters and will do whatever it does regardless of what encoding you specified on the serializer.
Which in turn produces broken XML - What Xerces should be doing is testing the encoding of the underlying Writer, and if different from the encoding specified for serialisation, it should throw an exception and fail safely, rather than quietly continuing rendering broken output.

This is the behaviour I put in my code for output (not in Xerces), but it is not robust, because you cannot always tell the encoding on a Writer, and indeed, some e.g. a StringBufferWriter, do not have an encoding.

A possible behaviour is: - if Writer is an OutputStreamWriter, find it's encoding, use java.nio.charset.Charset to convert this to the IANA canonical name (if any) - issue a warning if the canonical name is a) begins with an x- b) begins with Mac since these names are not registered with IANA, and hence unlikely to have any interoperability - add an XML declaration with the canonical name as the encoding

But always always always, Web output is better in Unicode, e.g. utf-8. If the application passes you a Writer in some other encoding then the library writer is going to loose whatever they do.


Jeremy

http://java.sun.com/j2se/1.4.2/docs/api/java/io/OutputStreamWriter.html#getEncoding()

Thank you for pointing me in the right direction for solving this - it had us stumped for ages.
Regards,
Graham
--


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Windows only bug: UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence

Reply via email to