Thanks for the clue on "transcoding", with that, I found a code fragment
in the FAQ
==== Info for absolute xerces-c beginners like me ====
Xerces stores its values in arrays of type XMLCh, these need to be
transcoded into your own preferred output encoding scheme. So if you
want Utf-8 encoding rather than the "default code page encoding" try
this:
utf8Transcoder =
XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF8",
failReason, 16*1024);
// you may need to call this repeatedly until fully transcoded
utf8Len = utf8Transcoder->transcodeTo(value.rawBuffer(), value.length(),
utf8, sizeof utf8, charsEaten, XMLTranscoder::UnRep_Throw );
utf8[utf8Len] = '\0';
printf("UTF8(%s)\n", utf8 );
=== what's the default code page on HP-UX? =====
$ locale -ck LC_CTYPE
LC_CTYPE
direction="0"
context="0"
code_set_name="roman8"
alt_punct=""
upper=
lower=
alpha=
digit=
space=
cntrl=
punct=
graph=
print=
xdigit=
blank=
toupper=
tolower=
(my guess is that roman8 is the same as latin-1 ...?)
-----Original Message-----
From: Alberto Massari [mailto:[EMAIL PROTECTED]
Sent: 07 June 2005 17:31
To: [email protected]
Subject: Re: utf-8 encoded attribute values
Hi Robert,
once the file has been parsed, all you see is 16-bit Unicode values
(UTF-16); that, if you deal only with english text, will look the same
as
latin-1.
Usually you shouldn't care about seeing the original UTF-8 sequence, as
you
should be interested in the actual character being represented; but if
you
need it for a valid reason, you should instanciate the UTF8Transcoder
and
tell him to transcode from Unicode to UTF-8.
Alberto
At 12.53 07/06/2005 +0100, Robert Parker wrote:
>Hi
>
>I am parsing an XML string that is encoded in UTF-8 and I am using the
>following code to view element attributes:
>
> DOM_NamedNodeMap NodeMap = node.getAttributes();
> if ( NodeMap != NULL) {
>
> unsigned int len = NodeMap.getLength();
> for ( int i = 0; i < len; ++i) {
> DOM_Node attr = NodeMap.item(i);
>
> DOMString tag = attr.getNodeName();
> char *t = tag.transcode();
> printf (" %s=", t );
> delete [] t;
>
> DOMString value = attr.getNodeValue();
> t = value.transcode();
> printf ("%s\n", t );
> delete [] t;
>for ( int i = 0; i < value.length() ; i++ )
>{
>printf( " AT %d %c %02x\n", i, value.charAt(i), value.charAt(i) ); }
> }
> }
>
>Both the transcode'd value and the "raw" value.charAt() shows my parsed
>attribute value as latin-1
>
>It seems to me that Xerces converts the UTF-8 encoded attribute values
>during the parse. How can I get Xerces to return the actual UTF-8
>encoded data rather than the latin-1 representation?
>
>(I am using Xerces 1.5.2 ! - I know it's old but I'm trying to avoid a
>massive upgrade exercise if at all possible)
>
>thanks
>Robert
>
>
>
>______________________________________________________________________
>This email has been scanned by the MessageLabs Email Security System.
>For more information please visit http://www.messagelabs.com/email
>______________________________________________________________________
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________