RE: utf-8 encoded attribute values

Robert Parker Thu, 09 Jun 2005 02:24:53 -0700

Thanks for the clue on "transcoding", with that, I found a code fragment
in the FAQ

==== Info for absolute xerces-c beginners like me ====
Xerces stores its values in arrays of type XMLCh, these need to be
transcoded into your own preferred output encoding scheme. So if you
want Utf-8 encoding rather than the "default code page encoding" try
this:

utf8Transcoder =
XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF8",
failReason, 16*1024);

// you may need to call this repeatedly until fully transcoded
utf8Len = utf8Transcoder->transcodeTo(value.rawBuffer(), value.length(),
utf8, sizeof utf8, charsEaten, XMLTranscoder::UnRep_Throw );

utf8[utf8Len] = '\0';
printf("UTF8(%s)\n", utf8 );

=== what's the default code page on HP-UX? =====
$ locale -ck LC_CTYPE
LC_CTYPE
direction="0"
context="0"
code_set_name="roman8"
alt_punct=""
upper=
lower=
alpha=
digit=
space=
cntrl=
punct=
graph=
print=
xdigit=
blank=
toupper=
tolower=

(my guess is that roman8 is the same as latin-1 ...?)

-----Original Message-----
From: Alberto Massari [mailto:[EMAIL PROTECTED] 
Sent: 07 June 2005 17:31
To: [email protected]
Subject: Re: utf-8 encoded attribute values

Hi Robert,
once the file has been parsed, all you see is 16-bit Unicode values 
(UTF-16); that, if you deal only with english text, will look the same
as 
latin-1.
Usually you shouldn't care about seeing the original UTF-8 sequence, as
you 
should be interested in the actual character being represented; but if
you 
need it for a valid reason, you should instanciate the UTF8Transcoder
and 
tell him to transcode from Unicode to UTF-8.

Alberto

At 12.53 07/06/2005 +0100, Robert Parker wrote:
>Hi
>
>I am parsing an XML string that is encoded in UTF-8 and I am using the 
>following code to view element attributes:
>
>     DOM_NamedNodeMap NodeMap    = node.getAttributes();
>     if ( NodeMap != NULL) {
>
>         unsigned int len = NodeMap.getLength();
>         for ( int i = 0; i < len; ++i) {
>             DOM_Node attr = NodeMap.item(i);
>
>             DOMString tag = attr.getNodeName();
>             char *t = tag.transcode();
>             printf ("    %s=", t );
>             delete [] t;
>
>             DOMString value     = attr.getNodeValue();
>             t = value.transcode();
>             printf ("%s\n", t );
>             delete [] t;
>for ( int i = 0; i < value.length() ; i++ )
>{
>printf( " AT %d %c %02x\n", i, value.charAt(i), value.charAt(i) ); }
>         }
>     }
>
>Both the transcode'd value and the "raw" value.charAt() shows my parsed

>attribute value as latin-1
>
>It seems to me that Xerces converts the UTF-8 encoded attribute values 
>during the parse. How can I get Xerces to return the actual UTF-8 
>encoded data rather than the latin-1 representation?
>
>(I am using Xerces 1.5.2 ! - I know it's old but I'm trying to avoid a 
>massive upgrade exercise if at all possible)
>
>thanks
>Robert
>
>
>
>______________________________________________________________________
>This email has been scanned by the MessageLabs Email Security System. 
>For more information please visit http://www.messagelabs.com/email 
>______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

RE: utf-8 encoded attribute values

Reply via email to