RE: XML decoding question

Tony Dodd Wed, 24 Aug 2005 10:31:02 -0700

There is no such thing as a windows-1252 XMLChar. XMLChar are 16-bit values and they have just one encoding, Unicode. It is only with char strings that you need to specify an encoding. Irrespective of the encoding in your document startElement will give you a string of 16 bit values.

Now assuming you make the right XML declaration some of these characters will be in the range 0-256 and can therefore be cast to char and some will not. For example, the dash character that is hex 96 in your input XML file has Unicode code point U+2013 and so cannot be safely cast to char. This doesn't matter if you just want to manipulate the strings in some way and output them in a new XML file, but your problem appears to be how to display them.

Assuming you are on an NT based Windows system such as XP the simple answer is to use the Windows Unicode API to output them directly. All Windows API functions that handle text come in two flavours, ASCII and Unicode, and the Windows heades files will use the Unicode versions provided you define the symbol _UNICODE - if you are using Visual Studio just go to project properties and select "Use Unicode Character set". Now you will find that you can pass the XMLChar strings you get from Xerces direct to ExtTextOut. Or if you need portability you can use the ICU layout engine.

If for some reason this is not an option there are other possibilities:

(i) There is a thing called Microsoft layer for Unicode that you can ship with your application to make Unicode display functions available on platforms such as Win 98 where they are not native;

(ii) - but this in my view is a last resort - you can find an 8-bit font that contains all the characters you want and transcode back to its encoding. For example, there are plenty of fonts that support windows-1252, and so you:

* ask the transcoding service for a transcoder for windows 1252

* use the transcoder (transcodeTo) to turn your strings of XMLChar into char strings

* select your chosen font into the drawing context

* use the 8-bit API to output the transcoded string.

Tony Dodd

OU Xaira Project

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 24 August 2005 17:49
To: [email protected]
Subject: RE: XML decoding question

Is it possible to convert a windows-1252 XMLCh* to char*, so that windows application could display it?

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 12:34 PM
To: [email protected]
Subject: RE: XML decoding question

I strongly recommend getting the producer of your documents to produce valid XML. If they're encoded in windows-1252, the encoding declaration should reflect that. Then you can just pass the document to the parser and everything will be happy.

I suspect you're on your way to creating an elaborate (and probably delicate) workaround for a fundamental flaw outside your code. It's not surprising that you're getting confused; you're trying to outwit a carefully designed system, and I have the impression that you're not completely comfortable with character encodings, transcoding among them, and why Xerces operates as it does. All of that ceases to be an issue if your input documents are valid. Again, these documents would be valid if they correctly declared their encoding ("WINDOWS-1252" rather than "UTF-8" in the example you sent). Alternatively, whoever produces the documents could transcode all content into UTF-8 before adding it to the documents. That's probably the most robust solution.
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 11:51 AM
To: [email protected]
Subject: RE: XML decoding question

Sorry, I am getting really confused.

So, I have XML that is in windows-1252. When parsing this document Xerces is calling startElement() method which takes XMLCh* as a parameter.

I need to be able to convert this to a char*. Since the XMLString::transcode() function is returning garbage, I need to create my own transcoder. What do I need to transcode it to?

If I am to call the transcodeTo or transcodeFrom functions, I need to know the number of XMLChs in the input parameter. How can I know the length of XMLCh *?

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 10:37 AM
To: [email protected]
Subject: RE: XML decoding question

XMLString::transcode() always transcodes to the native code page (which may vary from machine to machine and user to user). If that's good enough for your purposes, by all means use it.

To transcode to a particular encoding, use XMLTransService::makeNewTranscoderFor() to create a transcoder for the encoding, then call the transcoder's transcodeTo() member, passing in a pointer to a buffer for the output.
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 10:28 AM
To: [email protected]
Subject: RE: XML decoding question

doesn't it do the same kind of transformation as XMLString::transcode()?

could you give me an example of how to use it?

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 10:24 AM
To: [email protected]
Subject: RE: XML decoding question

XMLTranscoder::transcodeTo(), assuming that all characters in the target encoding fit into a char.
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 10:12 AM
To: [email protected]
Subject: RE: XML decoding question

Ok, also, some simple question... Is there a function that will convert XMLCh* to char*?

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 10:08 AM
To: [email protected]
Subject: RE: XML decoding question

You should be able to just parse the document after changing the encoding. Don't transcode it; if you do, the document will be using Xerces's internal encoding (UTF-16) but claiming to be windows-1252.
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 10:05 AM
To: [email protected]
Subject: RE: XML decoding question

Yes, I did that. I changed the encoding to windows-1252, but the transcode() function returned garbage.

-----Original Message-----
From: Tony Dodd [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 10:03 AM
To: [email protected]
Subject: RE: XML decoding question

That's not surprising since Xerces has no way of knowing the encoding other than reading the XML declaration, and the XML declaration is wrong. If you correct the XML declaration so that it says

encoding="windows-1252"

you should then be able to parse the document without more ado - Xerces will automatically convert to its internal 16-bit Unicode without any need for explicit transcoding on your part. Or am I misunderstanding what you are trying to do?

Tony Dodd

OU Xaira Project
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 24 August 2005 14:54
To: [email protected]
Subject: RE: XML decoding question

I am on windows, but trying to transcode the xml by using windows-1252 gives me garbage characters everywhere starting from the first special character.

-----Original Message-----
From: Tony Dodd [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 9:52 AM
To: [email protected]
Subject: RE: XML decoding question

The text you posted is encoded in the code page whose IANA name is windows-1252 - see http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx. If you are on Windows you should be able to transcode to this; on Linux you will need to build Xerces with the ICU transcoders.

Tony Dodd

OU Xaira Project
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 24 August 2005 14:00
To: [EMAIL PROTECTED]; [email protected]
Subject: RE: XML decoding question

I am still struggling with the same problem. The XML we are getting says it's encoded in utf-8, but it's not. It contains "special" word characters (', ", and -) which are outside ascii values and supposed to be part of extended ascii. I've tried specifying different encodings, but the best result was that XMLString::transcode() function replaced those charcters with "?".

The extended ascii values for these characters are hex: 92('),93("),94("),96(-). I need to be able to get the same values back from the transcode() function.

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 23, 2005 11:02 AM
To: [email protected]
Subject: RE: XML decoding question

You may not need to do anything special. Will your text contain characters that are not US-ASCII? If not, you can just pass the text to the parser and it'll be happy (as long as there's no incompatible encoding declaration in the document), because US-ASCII is a subset of UTF-8. If your text has characters outside the US-ASCII range, they are encoded somehow. If you know the encoding, you can use a transcoder to decode to UTF-8. (See the documentation for XMLTransService and XMLTranscoder and its derived classes.) This is risky business, though: who- or whatever produces the document should specify the actual encoding used in an encoding declaration. Otherwise, you have to assume you know the encoding, and it's possible that you'll get it wrong (especially in the future, when people use your code for purposes you never envisioned). If you transcode a document that correctly declares itself to be, say UTF-16, to UTF-8 before giving it to the parser, the parser will choke when it discovers that the actual encoding does not match the declared encoding (unless you alter the encoding declaration).

Again, I think a more specific statement of your problem would be useful. What are you trying to accomplish? Do you have code that is not behaving as you want it to? Can you reproduce any problem you're having using one of the sample applications and a document that you can attach to a message?

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 23, 2005 9:40 AM
To: [email protected]
Subject: RE: XML decoding question

Ok, if I have a text string with XML, is it possible to encode it in UTF-8 first before passing it on to the parser routine? How can that be done?

Thanks,

Marina

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 18, 2005 9:33 AM
To: [email protected]
Subject: RE: XML decoding question

I think the answer to your question is "no." In the world of XML, all documents are represented in some encoding. In fact, this is true of all text that is represented as a sequence of bits. People in the US have just gotten used to thinking of US-ASCII as being "plain text," but people whose native languages include characters that are not representable in US-ASCII see it differently.

In order to do anything useful, an XML processor has to represent the characters in a document in some way that it can understand (so it can find the characters that surround tags and attributes, etc). To comply with the DOM spec, a processor must encode DOMStrings as UTF-16. Since implementation is simpler (and therefore more reliable) if all characters are treated the same, it makes sense to represent all text (internally) as UTF-16.

So the bottom line is, to do any useful work, an XML processor MUST successfully transform the sequence of bits that make up a document from the document encoding to an encoding that the processor understands. In the case of Xerces, the target encoding is UTF-16.

The obvious next question is what you're trying to accomplish.

As for the signature line, perhaps if enough people point out to pointy-haired-bosses that you can't get your work done if people pay attention to it, and it makes the company look silly to boot, they'd get the message. Maybe it's tilting at windmills, but then again, maybe the squeaky wheel will get the grease.

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 18, 2005 9:11 AM
To: [email protected]
Subject: RE: XML decoding question

XML document has encoding clause like encoding="utf-8", etc. It is possible to overwrite this in the code by calling setEncoding() method on the InputSource in the parser. Is it possible not to do any decoding and just treat the file as plain text?

Sorry, I can't do anything about the text that is automatically added to the email... Corporate policy of Verizon Wireless.
___________________________________________________________________
The information contained in this message and any attachment may be
proprietary, confidential, and privileged or subject to the work
product doctrine and thus protected from disclosure.  If the reader
of this message is not the intended recipient, or an employee or
agent responsible for delivering this message to the intended
recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify me
immediately by replying to this message and deleting it and all
copies and backups thereof.  Thank you.

RE: XML decoding question

Reply via email to