Hi Robert,
once the file has been parsed, all you see is 16-bit Unicode values
(UTF-16); that, if you deal only with english text, will look the same as
latin-1.
Usually you shouldn't care about seeing the original UTF-8 sequence, as you
should be interested in the actual character being represented; but if you
need it for a valid reason, you should instanciate the UTF8Transcoder and
tell him to transcode from Unicode to UTF-8.
Alberto
At 12.53 07/06/2005 +0100, Robert Parker wrote:
Hi
I am parsing an XML string that is encoded in UTF-8 and I am using the
following code to view element attributes:
DOM_NamedNodeMap NodeMap = node.getAttributes();
if ( NodeMap != NULL) {
unsigned int len = NodeMap.getLength();
for ( int i = 0; i < len; ++i) {
DOM_Node attr = NodeMap.item(i);
DOMString tag = attr.getNodeName();
char *t = tag.transcode();
printf (" %s=", t );
delete [] t;
DOMString value = attr.getNodeValue();
t = value.transcode();
printf ("%s\n", t );
delete [] t;
for ( int i = 0; i < value.length() ; i++ )
{
printf( " AT %d %c %02x\n", i, value.charAt(i), value.charAt(i) );
}
}
}
Both the transcode'd value and the "raw" value.charAt() shows my parsed
attribute value as latin-1
It seems to me that Xerces converts the UTF-8 encoded attribute values
during the parse.
How can I get Xerces to return the actual UTF-8 encoded data rather than
the latin-1 representation?
(I am using Xerces 1.5.2 ! - I know it's old but I'm trying to avoid a
massive upgrade exercise if at all possible)
thanks
Robert
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________