RE: XML decoding question

Jesse Pelton Wed, 24 Aug 2005 06:56:36 -0700

Aha. So you want to somehow fix up invalid documents. The only solution that is really correct (and therefore reliable) is to require valid documents. Who- or whatever generates the documents must use the declared encoding or the documents are not valid XML. If they're not valid, they can't reasonably be expected to be understood by an XML processor. You might as well use some other proprietary format, because what you're using isn't XML.

In this case, so-called "extended ASCII" appears to mean "characters whose values range from 0x80 - 0xFF in the Windows-1252 code page," which is consistent with your August 9 message. If you know for certain that the characters in the document all come from Windows-1252, changing the declared encoding to "WINDOWS-1252" should do the trick. This is the responsibility of the document producer, but if you like living dangerously, you could change the encoding declaration in the document before parsing it, or just assume it's Windows-1252 masquerading as UTF-8 and use a Windows-1252 transcoder to transcode to UTF-16, then transcode the result to UTF-8 (assuming that's what the encoding declaration is). But really, get the document producer to create valid documents.

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 9:00 AM
To: Jesse Pelton; [email protected]
Subject: RE: XML decoding question

I am still struggling with the same problem. The XML we are getting says it's encoded in utf-8, but it's not. It contains "special" word characters (', ", and -) which are outside ascii values and supposed to be part of extended ascii. I've tried specifying different encodings, but the best result was that XMLString::transcode() function replaced those charcters with "?".

The extended ascii values for these characters are hex: 92('),93("),94("),96(-). I need to be able to get the same values back from the transcode() function.

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 23, 2005 11:02 AM
To: [email protected]
Subject: RE: XML decoding question

You may not need to do anything special. Will your text contain characters that are not US-ASCII? If not, you can just pass the text to the parser and it'll be happy (as long as there's no incompatible encoding declaration in the document), because US-ASCII is a subset of UTF-8. If your text has characters outside the US-ASCII range, they are encoded somehow. If you know the encoding, you can use a transcoder to decode to UTF-8. (See the documentation for XMLTransService and XMLTranscoder and its derived classes.) This is risky business, though: who- or whatever produces the document should specify the actual encoding used in an encoding declaration. Otherwise, you have to assume you know the encoding, and it's possible that you'll get it wrong (especially in the future, when people use your code for purposes you never envisioned). If you transcode a document that correctly declares itself to be, say UTF-16, to UTF-8 before giving it to the parser, the parser will choke when it discovers that the actual encoding does not match the declared encoding (unless you alter the encoding declaration).

Again, I think a more specific statement of your problem would be useful. What are you trying to accomplish? Do you have code that is not behaving as you want it to? Can you reproduce any problem you're having using one of the sample applications and a document that you can attach to a message?

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 23, 2005 9:40 AM
To: [email protected]
Subject: RE: XML decoding question

Ok, if I have a text string with XML, is it possible to encode it in UTF-8 first before passing it on to the parser routine? How can that be done?

Thanks,

Marina

-----Original Message-----
From: Jesse Pelton [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 18, 2005 9:33 AM
To: [email protected]
Subject: RE: XML decoding question

I think the answer to your question is "no." In the world of XML, all documents are represented in some encoding. In fact, this is true of all text that is represented as a sequence of bits. People in the US have just gotten used to thinking of US-ASCII as being "plain text," but people whose native languages include characters that are not representable in US-ASCII see it differently.

In order to do anything useful, an XML processor has to represent the characters in a document in some way that it can understand (so it can find the characters that surround tags and attributes, etc). To comply with the DOM spec, a processor must encode DOMStrings as UTF-16. Since implementation is simpler (and therefore more reliable) if all characters are treated the same, it makes sense to represent all text (internally) as UTF-16.

So the bottom line is, to do any useful work, an XML processor MUST successfully transform the sequence of bits that make up a document from the document encoding to an encoding that the processor understands. In the case of Xerces, the target encoding is UTF-16.

The obvious next question is what you're trying to accomplish.

As for the signature line, perhaps if enough people point out to pointy-haired-bosses that you can't get your work done if people pay attention to it, and it makes the company look silly to boot, they'd get the message. Maybe it's tilting at windmills, but then again, maybe the squeaky wheel will get the grease.

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 18, 2005 9:11 AM
To: [email protected]
Subject: RE: XML decoding question

XML document has encoding clause like encoding="utf-8", etc. It is possible to overwrite this in the code by calling setEncoding() method on the InputSource in the parser. Is it possible not to do any decoding and just treat the file as plain text?

Sorry, I can't do anything about the text that is automatically added to the email... Corporate policy of Verizon Wireless.
___________________________________________________________________
The information contained in this message and any attachment may be
proprietary, confidential, and privileged or subject to the work
product doctrine and thus protected from disclosure.  If the reader
of this message is not the intended recipient, or an employee or
agent responsible for delivering this message to the intended
recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify me
immediately by replying to this message and deleting it and all
copies and backups thereof.  Thank you.

RE: XML decoding question

Reply via email to