From: Milan Tomic [mailto:[EMAIL PROTECTED]
Sent: Friday, August 19, 2005 4:53 AM
To: [email protected]
Subject: Encoding issues
I'm parsing my XML like this using Xerces 2.5.0:
MemBufInputSource *mbis = new MemBufInputSource((const unsigned char *const)xml, strlen(xml), L"...");
parser->parse(*mbis);The problem is that in my xml there is no encoding information:
<?xml version="1.0"?>
Originally my file was like this:
<?xml version="1.0" encoding="UTF-8"?>
but encoding info get lost when I use MSXML parser in JScript, because of conversions UTF-8 -> UTF-16...
Is there a way to tell Xerces which encoding was used for XML? Something like this:
MemBufInputSource *mbis = new MemBufInputSource((const unsigned char *const)xml, strlen(xml), L"...");
parser->setEncoding(L"UTF-8");
parser->parse(*mbis);Thank you in advance,
Milan
Title: Encoding issues
A call like the following before you parse should do the
trick:
mbis->setEncoding(L"UTF-8");
But think twice before doing this. MSXML is within its
rights to omit the encoding as long as the document is UTF-8 or UTF-16. (See http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding.)
You might want to leave your code as it is and let the parser determine the
encoding by inspecting the first few bytes. (See http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing.)
If you force it to be UTF-8, you'll be in trouble if you get a document in
another encoding. Leaving it to the parser allows you to handle UTF-8 and both
flavors of UTF-16, as well as any other encodings you have a transcoder
for.
- Encoding issues Milan Tomic
- RE: Encoding issues Jesse Pelton
