Nice explaination James, I'll try to get this into the XML FAQ.

--K

> James Snook wrote:
> 
> Hi all,
> 
> I just thought I'd post a quick solution to a problem I struggled with
> today. (This should, perhaps, be added to the Javadoc for the
> Marshaller class.)
> 
> If you ever have problems involving UTFDataFormatExceptions (with
> messages like "invalid byte 1 of 1-byte UTF-8 sequence (0x96)"), your
> OutputStreamWriter's character encoding probably doesn't match your
> XML document's character encoding. This is a situation you may easily
> find yourself in if you are coding on a Windows machine because the
> default character encoding for the Windows JVM is "Cp1252"
> (ANSI/Windows-1252) while the default Castor encoding is "UTF-8". This
> is an especially hard problem to debug because you will only
> experience problems when certain Windows-specific characters are used,
> since the characters which make up 99.99% of your XML documents will
> be the lower (ASCII) characters, which have the same encodings under
> both ANSI and UTF-8. That is, A-Z, 0-9 and a whole bunch of other
> stuff has the same byte encoding when encoded using ANSI as it does
> when encoded using UTF-8.
> 
> However, the "en dash" (ANSI byte encoding 0x96) is an example of an
> ANSI character which maps to a byte that cannot possibly occur (on its
> own) in a legal UTF-8 encoded document. If an XML document, then, is
> generated using the ANSI byte encoding, while the document itself
> states that the encoding is UTF-8, the Unmarshaller (well, the parser,
> actually) will choke on the bad byte (0x96) because the
> Unmarshaller follows the XML document's instructions, which tell it to
> decode assuming the byte encoding is UTF-8.
> 
> The solution is simple:
> 
> ** Always make sure your OutputStreamWriter is encoding characters
> using the same encoding as specified in the XML document.
> 
> That is, do something like this:
> 
> public static final String CHARSET = "UTF-8"; // NOTE: the
> Castor-generated XML document's encoding *must*
>                                               //       match the
> encoding scheme the OutputStreamWriter uses
>                                               //       to generate the
> XML document
> ...
> 
> Marshaller marshaller = new Marshaller(new
> OutputStreamWriter(xmlOutputStream, CHARSET));
> marshaller.setEncoding(CHARSET);
> (With your chosen character set. UTF-8 is a good choice because all
> parsers should support this encoding.)
> 
> Anyhow, I hope this helps someone.
> 
> - James

----------------------------------------------------------- 
If you wish to unsubscribe from this mailing, send mail to
[EMAIL PROTECTED] with a subject of:
        unsubscribe castor-dev

Reply via email to