[castor-dev] Character Encodings

James Snook Tue, 03 Dec 2002 15:13:50 -0800

Hi all,

I just thought I'd post a quick solution to a problem I struggled with today. (This should, perhaps, be added to the Javadoc for the Marshaller class.)

If you ever have problems involving UTFDataFormatExceptions (with messages like "invalid byte 1 of 1-byte UTF-8 sequence (0x96)"), your OutputStreamWriter's character encoding probably doesn't match your XML document's character encoding. This is a situation you may easily find yourself in if you are coding on a Windows machine because the default character encoding for the Windows JVM is "Cp1252" (ANSI/Windows-1252) while the default Castor encoding is "UTF-8". This is an especially hard problem to debug because you will only experience problems when certain Windows-specific characters are used, since the characters which make up 99.99% of your XML documents will be the lower (ASCII) characters, which have the same encodings under both ANSI and UTF-8. That is, A-Z, 0-9 and a whole bunch of other stuff has the same byte encoding when encoded using ANSI as it does when encoded using UTF-8.

However, the "en dash" (ANSI byte encoding 0x96) is an example of an ANSI character which maps to a byte that cannot possibly occur (on its own) in a legal UTF-8 encoded document. If an XML document, then, is generated using the ANSI byte encoding, while the document itself states that the encoding is UTF-8, the Unmarshaller (well, the parser, actually) will choke on the bad byte (0x96) because the Unmarshaller follows the XML document's instructions, which tell it to decode assuming the byte encoding is UTF-8.

The solution is simple:

** Always make sure your OutputStreamWriter is encoding characters using the same encoding as specified in the XML document.

That is, do something like this:

public static final String CHARSET = "UTF-8"; // NOTE: the Castor-generated XML document's encoding *must*
// match the encoding scheme the OutputStreamWriter uses
// to generate the XML document
...

Marshaller marshaller = new Marshaller(new OutputStreamWriter(xmlOutputStream, CHARSET));
marshaller.setEncoding(CHARSET);

(With your chosen character set. UTF-8 is a good choice because all parsers should support this encoding.)

Anyhow, I hope this helps someone.

- James

[castor-dev] Character Encodings

Reply via email to