RE: R: R: R: using non standard character with zerces

Jesse Pelton Mon, 19 Sep 2005 06:24:36 -0700

Sure, you can store 0xA5 in a DOM string, but you have to represent it properly 
in the string that you store.  This means you have to store the character value 
0xA5 in the string; you cannot represent it in the string as a numeric entity 
like "&#xA5":


   XMLCh* pszA5Good = X("\xA5");  // Yen
   XMLCh* pszA5Bad  = X("&#xA5"); // goobledygook

Both strings are perfectly legitimate, but if you put the latter into the DOM, 
the serializer MUST escape the ampersand so that the string you are adding to 
the DOM can be faithfully recovered.  In other words, if you say the string is 
"&#xA5", the serializer must escape it so that when it's parsed, the string's 
value remains "&#xA5", because that's the string you specified.

If you put the former into the DOM, the serializer will likewise do what it 
must to ensure that the specified string comes back when Xerces or some other 
conforming XML processor parses the document.  Depending on the document 
encoding, it may or may not be serialized as "&#xA5."  Any conforming processor 
that recognizes the document encoding will parse the serialized value correctly.

The bottom line is, don't pre-escape anything that you put into the DOM.  If 
you do, the serializer must escape it again, and you won't get your desired 
results.  Rather than:

  stmp = " start &apos; &lt; &gt;  &amp; &#x28; &#xA4; &#xA5; &#x29; end";
  dtxt = pDoc->createTextNode( X( stmp.c_str()));

Do:

  dtxt = pDoc->createTextNode( X(" start ' < >  & \x28 \xA4; \xA5; \x29 end)");

Or equivalently:

  dtxt = pDoc->createTextNode( X(" start ' < >  & ( \xA4; \xA5; ) end)");

Note that none of this is specific to Xerces.  Any XML processor that conforms 
to the specifications (available at www.w3.org) must behave this way.

> -----Original Message-----
> From: AESYS S.p.A. [Enzo Arlati] [mailto:[EMAIL PROTECTED] 
> Sent: Monday, September 19, 2005 8:48 AM
> To: [email protected]
> Subject: R: R: R: R: using non standard character with zerces
> 
> Do you mean that using DOM is not possible to store value like xA5 ?
> I try to load a file with extrachars using XercesDOMParser, 
> the I got the
> DOM from tha parser and I print it, it have the extra chars
> 
> 
> output:
> DOCUMENT: <?xml version="1.0" encoding="UTF-16" standalone="no"
> ?><Messaggio>
>     <Test1> start ' &lt; &gt;  &amp; ( ¤ ¥ )  end  </Test1>
> </Messaggio>
> 
> Premi Invio per continuare!
> 
> 
> 
> 
> *****************************************
> * input file
> *****************************************
> <?xml version="1.0"  encoding="UTF-16" standalone="no" ?>
> <Messaggio>
>     <Test1> start &apos; &lt; &gt;  &amp; &#x28; &#xA4; 
> &#xA5; &#x29;  end
> </Test1>
> </Messaggio>
> 
> 
> 
> *****************************************
> * reading the file with XercesDOMParser *
> *****************************************
> 
>     XercesDOMParser * domParser;
> 
>     // -------------------------------------------------------
>     domParser = new XercesDOMParser;
>     domParser->setValidationScheme( XercesDOMParser::Val_Auto );
>     domParser->setDoNamespaces( false );
>     domParser->setDoSchema( false );
>     domParser->setValidationSchemaFullChecking( false );
>     domParser->setCreateEntityReferenceNodes( false );
> 
>     DOMTreeErrorReporter * errReporter = new DOMTreeErrorReporter();
>     domParser->setErrorHandler( (ErrorHandler*)  errReporter );
> 
>     string sfile( "/test/test1.xml" );
>     domParser->parse( sfile.c_str() );
>     delete errReporter;
>     int nerr = domParser->getErrorCount();
>     if( nerr > 0 )
>     {
>        MYLOG( ae_util::format_string( "PARSE FAILED file=[%s] 
> num.err=%d ",
>                sfile.c_str(), nerr ));
>        return IRET_ERROR;
>     }
> 
>     DOMDocument * pDoc = domParser->getDocument();
>     stmp = ManagCmd::GetStringFromDOMDocument( pDoc );
>     cout << "DOCUMENT: " + stmp << endl;
>     delete domParser;
> 
> §§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§
> 
> why this code , behave differently ?
> 
>     DOMText * dtxt;
>     DOMImplementation * impl =
> DOMImplementationRegistry::getDOMImplementation( X("LS") );
> 
>     if( impl != NULL )
>     {
>        DOMDocument * pDoc = impl->createDocument( 0, 
> X("Messaggio"), 0 );
>        pDoc->setEncoding( X("UTF-16") );
>        DOMElement * pRoot = pDoc->getDocumentElement();
> 
>        DOMElement * pTest = pDoc->createElement( X("TEST2") );
>        pRoot->appendChild( pTest );
> 
>        stmp = " start &apos; &lt; &gt;  &amp; &#x28; &#xA4; 
> &#xA5; &#x29;
> end";
>        dtxt                  = pDoc->createTextNode( X( 
> stmp.c_str()));
>        pTest->appendChild( dtxt );
>        stmp = ManagCmd::GetStringFromDOMDocument( pDoc );
>        cout << "DOCUMENT: " + stmp << endl;
>     }
> 
> 
> **************
> * output:
> **************
> DOCUMENT: <?xml version="1.0" encoding="UTF-16" standalone="no"
> ?><Messaggio><TEST2> start &amp;apos; &amp;lt; &amp;gt;  &amp;amp;
> &amp;#x28; &amp;#xA4; &amp;#xA5; &amp;#x29;  end</TEST2></Messaggio>
> 
> Premi Invio per continuare!
> 
> 
> 
> -----Messaggio originale-----
> Da: Alberto Massari [mailto:[EMAIL PROTECTED]
> Inviato: lunedì 19 settembre 2005 10.28
> A: [email protected]
> Oggetto: Re: R: R: R: using non standard character with zerces
> 
> 
> Hi Enzo,
> if you want to place reserved characters in the
> final XML, you should not use DOM. When you
> create a DOMText node you are asking "this is the
> text you must store, be sure that it is stored in
> a way that, when later retrieved, it's still this
> text". So, if you use reserved characters like
> "&", they get expanded into "&amp;" so that, upon
> loading, you find "&" in the corresponding
> DOMText. If you need to manually compose an XML
> you are better off using XMLFormatter and feeding
> it with literals like "<nodename>&#x23;</nodename>".
> 
> Alberto
> 
> At 10.03 19/09/2005 +0200, AESYS S.p.A. [Enzo Arlati] wrote:
> >But what I need is really a very simple way which enable me 
> to put inside
> >the xml stream a  sequence of char , including the & char, 
> without this
> >latter be parsed and translated in &amp;.
> >Which xerces there are no mean to tell the parser to avoid 
> to translate
> some
> >or all the characters of an output string ?
> >
> >-----Messaggio originale-----
> >Da: Alberto Massari [mailto:[EMAIL PROTECTED]
> >Inviato: venerdì 16 settembre 2005 18.51
> >A: [email protected]
> >Oggetto: Re: R: R: using non standard character with zerces
> >
> >
> >Hi Enzo,
> >
> >At 18.05 16/09/2005 +0200, AESYS S.p.A. [Enzo Arlati] wrote:
> > >But when can I include special character inside a node.
> > >I want to use the format &#xXX . but the '&' where processed and
> translate
> > >in &amp; so the character &#xA5; whill be converted to 
> &amp;#xA5 instead
> of
> > >the desired current character entitity.
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: R: R: R: using non standard character with zerces

Reply via email to