From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 09, 2005 10:19 AM
To: [email protected]
Subject: problem with XML encodingWe are using Xerces SAX parser to parse the incoming XML. In some cases the XML is formed with characters that were copied and pasted from MS Word document. It seems that the character set should be "windows-1252" in this case.
If such an XML is parsed with "utf-8" encoding, Internet Explorer and out application give the same error message that there is an invalid character encountered. When this XML is parsed with "windows-1252" IE is able to display it properly, but our application does not. The character set in out application is set to 1252.
Why are we not able to display the characters properly? Does anybody know the solution to this?
Attached is the sample XML file, and a word document with screen shots of the problem in our application.
Thanks,Marina908 607 8580
When I edited the document to change the encoding from
UTF-8 to WINDOWS-1252, both DOMPrint and SAX2Print were able to process the
file. If you run the same experiment and get the same results, this indicates a
problem with your application rather than with Xerces.
If your application is overriding the document's
declared encoding, note that this is risky business. Documents should correctly
declare their encoding. When an application overrides the document encoding, it
presumes to know more about the document than the document's author. It's
sometimes necessary nonetheless, as the documentation
for InputSource::setEncoding() points out, but this case does not
seem to fit the pattern described there.

