RE: Encoding issues

Jesse Pelton Fri, 19 Aug 2005 05:39:32 -0700

Title: Encoding issues

A call like the following before you parse should do the trick:

mbis->setEncoding(L"UTF-8");

But think twice before doing this. MSXML is within its rights to omit the encoding as long as the document is UTF-8 or UTF-16. (See http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding.) You might want to leave your code as it is and let the parser determine the encoding by inspecting the first few bytes. (See http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing.) If you force it to be UTF-8, you'll be in trouble if you get a document in another encoding. Leaving it to the parser allows you to handle UTF-8 and both flavors of UTF-16, as well as any other encodings you have a transcoder for.

From: Milan Tomic [mailto:[EMAIL PROTECTED]
Sent: Friday, August 19, 2005 4:53 AM
To: [email protected]
Subject: Encoding issues

I'm parsing my XML like this using Xerces 2.5.0:

MemBufInputSource *mbis = new MemBufInputSource((const unsigned char *const)xml, strlen(xml), L"...");
parser->parse(*mbis);

The problem is that in my xml there is no encoding information:

<?xml version="1.0"?>

Originally my file was like this:

<?xml version="1.0" encoding="UTF-8"?>

but encoding info get lost when I use MSXML parser in JScript, because of conversions UTF-8 -> UTF-16...

Is there a way to tell Xerces which encoding was used for XML? Something like this:

MemBufInputSource *mbis = new MemBufInputSource((const unsigned char *const)xml, strlen(xml), L"...");
parser->setEncoding(L"UTF-8");
parser->parse(*mbis);

Thank you in advance,
Milan

RE: Encoding issues

Reply via email to