Hi Robert,
once the file has been parsed, all you see is 16-bit Unicode values (UTF-16); that, if you deal only with english text, will look the same as latin-1. Usually you shouldn't care about seeing the original UTF-8 sequence, as you should be interested in the actual character being represented; but if you need it for a valid reason, you should instanciate the UTF8Transcoder and tell him to transcode from Unicode to UTF-8.

Alberto

At 12.53 07/06/2005 +0100, Robert Parker wrote:
Hi

I am parsing an XML string that is encoded in UTF-8 and I am using the
following code to view element attributes:

    DOM_NamedNodeMap NodeMap    = node.getAttributes();
    if ( NodeMap != NULL) {

        unsigned int len = NodeMap.getLength();
        for ( int i = 0; i < len; ++i) {
            DOM_Node attr = NodeMap.item(i);

            DOMString tag = attr.getNodeName();
            char *t = tag.transcode();
            printf ("    %s=", t );
            delete [] t;

            DOMString value     = attr.getNodeValue();
            t = value.transcode();
            printf ("%s\n", t );
            delete [] t;
for ( int i = 0; i < value.length() ; i++ )
{
printf( " AT %d %c %02x\n", i, value.charAt(i), value.charAt(i) );
}
        }
    }

Both the transcode'd value and the "raw" value.charAt() shows my parsed
attribute value as latin-1

It seems to me that Xerces converts the UTF-8 encoded attribute values
during the parse.
How can I get Xerces to return the actual UTF-8 encoded data rather than
the latin-1 representation?

(I am using Xerces 1.5.2 ! - I know it's old but I'm trying to avoid a
massive upgrade exercise if at all possible)

thanks
Robert



______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________


Reply via email to