RE: XML decoding question

Jesse Pelton Wed, 24 Aug 2005 10:32:39 -0700

I'm not sure that I understand the question. "windows-1252 XMLCh*" is an oxymoron. windows-1252 is an 8-bit character encoding; XMLCh is a 16-bit encoding (UTF-16). But I'll try to answer what I think you may be driving at.

If you have a valid XML document encoded in windows-1252 and your system understands this encoding (which Windows systems do), Xerces can parse the document. You can then extract strings of XMLCh characters encoded in UTF-16 from the DOM. If you want to display the string, you can use XMLString::transcode(), which will transcode to the current local code page. (I just noticed that the variant that transcodes to char * is deprecated, though. You're better off creating a local page transcoder. Use XMLPlatformUtils::fgTransService->makeNewLCPTranscoder().) You've said that this produces "garbage." It might be helpful if you provided a sample of the input and the output along with your current code page. I'm not quite sure what would happen if the input included a character that is not representable in the output encoding.

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 12:49 PM
To: [email protected]
Subject: RE: XML decoding question

Is it possible to convert a windows-1252 XMLCh* to char*, so that windows application could display it?

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 12:34 PM
To: [email protected]
Subject: RE: XML decoding question

I strongly recommend getting the producer of your documents to produce valid XML. If they're encoded in windows-1252, the encoding declaration should reflect that. Then you can just pass the document to the parser and everything will be happy.

I suspect you're on your way to creating an elaborate (and probably delicate) workaround for a fundamental flaw outside your code. It's not surprising that you're getting confused; you're trying to outwit a carefully designed system, and I have the impression that you're not completely comfortable with character encodings, transcoding among them, and why Xerces operates as it does. All of that ceases to be an issue if your input documents are valid. Again, these documents would be valid if they correctly declared their encoding ("WINDOWS-1252" rather than "UTF-8" in the example you sent). Alternatively, whoever produces the documents could transcode all content into UTF-8 before adding it to the documents. That's probably the most robust solution.

RE: XML decoding question

Reply via email to