Re: Windows only bug: UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence

Jeremy Carroll Thu, 28 Apr 2005 02:47:37 -0700

Graham Leggett wrote:

Jeremy Carroll wrote:
Rubbish. It is the documented behaviour. It is well motivated; it enables the Java app to talk with the OS and other apps in the expected encoding.
Windows and Java are both Unicode native systems,

No. The default platform encoding on windows systems varies, and tends to be a windows-XXXX encoding. Windows supports unicode, but that doesn't make it the default encoding.

yet Java writes

Latin-1 by default on Windows.


It doesn't. It depends on the locale settings on your windows box.

Looks like a bug to me, regardless of how well documented it is (I find no documentation on it anywhere, we worked out this behaviour through trial and error).


In java.io.FileWriter

"The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable."

But regardless, Xerces lets this situation through without throwing warning or error, and here lies the problem.

On java.io.OutputStreamWriter javadoc

"For top efficiency, consider wrapping an OutputStreamWriter within a BufferedWriter "

In such a case, Xerces cannot detect the error.

It is a plausible decision with an XML output routine to: a) support the use of Writer's b) not attempt to detect encoding errors on use of Writers, because they cannot be detected uniformly

It would only be a bug, in my view, if OutputStream was not supported.

It just isn't appropriate for WebApps. What is broken is using a FileWriter in code intended for a Web application.
The code is not being used by a web application. It is being used to persist data to disk in a system backend.

Hmmm, I personally think it is worth seriously discouraging the use of Writer's for XML output, but there is a time and place for them.

A Writer is a mechanism for writing text data (in my case to disk but that's not important). XML is text data. A Writer is a perfectly logical choice to use in this case.

It's the wrong choice - I have made that mistake too, but it is a mistake. Of course, building an OutputStreamWriter in "utf8" or "utf16" encoding would be sensible for XML, but correct use of Writer's involves some understanding of character encodings, and appreciating that XML is best written in utf8 and that hence a FileWriter is a poor choice, because it assumes that the default character encoding (which in general is unknown and different from utf8) is acceptable.

It's your bug, take some responsibility and stop trying to blame someone else. When this bug hit me, it was a difficult bug to understand because of the need to understand character encoding issues ... but that didn't make it not my problem.

Similarly use of a FileReader to read XML is an error, using an InputStream is much better, except in special circumstances.

(Try turning the above para around for input:


> A Reader is a mechanism for reading text data (in my case to disk but
> that's not important). XML is text data. A Reader is a perfectly logical
> choice to use in this case.

but completely wrong for reading an XML file off a disk ... the XML file is in an a priori unknown encoding, it declares it's encoding in the first line, following the XML Rec; a Reader has a decoder for some charset in it, that charset may not be the one in the encoding declaration. Reading XML using a FileReader is an accident waiting to happen and is always a bug. Because the library that I support is largely used in experimental code, I put in error detection support for such cases; for production library it is not necessarily the wrong decision to not attempt to detect an error that cannot be detected uniformly, and is essentially an application coding error rather than a runtime problem.

Regards,
Graham
--


Jeremy


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Windows only bug: UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence

Reply via email to