|
Hi all,
I just thought I'd post a quick solution to a
problem I struggled with today. (This should, perhaps, be added to the Javadoc
for the Marshaller class.)
If you ever have problems involving
UTFDataFormatExceptions (with messages like "invalid byte 1 of 1-byte UTF-8
sequence (0x96)"), your OutputStreamWriter's character encoding probably doesn't
match your XML document's character encoding. This is a situation you may easily
find yourself in if you are coding on a Windows machine because the default
character encoding for the Windows JVM is "Cp1252" (ANSI/Windows-1252) while the
default Castor encoding is "UTF-8". This is an especially hard problem to debug
because you will only experience problems when certain Windows-specific
characters are used, since the characters which make up 99.99% of your XML
documents will be the lower (ASCII) characters, which have the same encodings
under both ANSI and UTF-8. That is, A-Z, 0-9 and a whole bunch of other stuff
has the same byte encoding when encoded using ANSI as it does when encoded
using UTF-8.
However, the "en dash" (ANSI byte
encoding 0x96) is an example of an ANSI character which maps to a byte that
cannot possibly occur (on its own) in a legal UTF-8 encoded document. If an
XML document, then, is generated using the ANSI byte encoding, while the
document itself states that the encoding is UTF-8, the Unmarshaller (well, the
parser, actually) will choke on the bad byte (0x96) because the
Unmarshaller follows the XML document's instructions, which tell it to
decode assuming the byte encoding is UTF-8.
The solution is simple:
** Always make sure your OutputStreamWriter is
encoding characters using the same encoding as specified in the XML document.
That is, do something like this:
public static final String CHARSET =
"UTF-8"; // NOTE: the Castor-generated XML document's encoding
*must*
// match the encoding scheme the OutputStreamWriter uses // to generate the XML document ... Marshaller marshaller = new Marshaller(new
OutputStreamWriter(xmlOutputStream,
CHARSET));
marshaller.setEncoding(CHARSET); (With your chosen
character set. UTF-8 is a good choice because all parsers should support this
encoding.)
Anyhow, I hope this helps
someone.
-
James
|
- Re: [castor-dev] Character Encodings James Snook
- Re: [castor-dev] Character Encodings Keith Visco
