Re: Problems with xerces-c version 1.7.0 and UTF-8

Anna Simbirtsev Wed, 17 Sep 2008 09:52:59 -0700

When I print it in hex format, I get
�: 0xffffffd0
�: 0xffffffb1
�: 0xffffffd0
�: 0xffffffb1
�: 0xffffffd0
�: 0xffffffb1


Which I am not even sure what format, but maybe my shell does not
know what it is.


On Wed, 2008-09-17 at 15:39 +0200, Alberto Massari wrote:
> Hi Anna,
> if I am not mistaken, the code you attached doesn't have the sample data 
> you are trying to parse (e.g. parseString is used to parse the result of 
> a toXML call on an extension object).
> However, you say "in the dom_wrapper.c I print the string before it is 
> passed to the xerces-c parser [...] and my value in utf-8 looks fine"; 
> in the code you write
> 
>    cout << "parseString: " << str << endl;
>    return parseMemory(str.c_str(),(int)str.length());
> 
> But the fact that your console prints the data as you expects doesn't 
> imply that the std::string contains real UTF-8; your shell could be 
> using a Japanese locale, and be able to print correctly 
> Shift_JIS-encoded strings (while failing to print UTF-8-encoded strings).
> If you want to really see what you are considering UTF-8, replace that 
> cout << str with this code
> 
> for(int i=0;i<str.length();i++)
>   cout << "0x" << hex << (int)str[i] << " ";
> cout << endl;
> 
> Alberto
> 
> Anna Simbirtsev wrote:
> > In the epp_eppXMLbase.cc in function createDOMDocument it calls
> > parseString function from domtools::XercesParser. In the dom_wrapper.c I
> > print the string before it is passed to the xerces-c parser in
> > domtools::XercesParser::parseMemory function and my value in utf-8 looks
> > fine. When it gets back from xerces-c a DOM_document, it uses XercesNode
> > object(defined in dom_wrapper.h) to store the DOM_document and break it
> > into nodes. Then in epp_eppXMLbase.cc, in function
> > eppobject::epp::addExtensionElements(EPP_output & outputobject, const
> > epp_extension_ref_seq_ref & extensions)
> >
> > it calls
> > DomPrint dp(outputobject);
> > dp.putDOMTree(extensionDoc);
> >
> > from dom_print.cc where I actually print the value in putDOMTree
> > function. Here the value looks truncated.
> > The entire source code of domtools is available on
> > http://sourceforge.net/project/showfiles.php?group_id=26675
> >
> > Thank you very much for your help.
> >
> > On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:
> >   
> >> Anna Simbirtsev wrote:
> >>     
> >>> I pass just plain xml string to the DOMParser, so I don't use the
> >>> transcode function.
> >>>
> >>> [...]
> >>> I just copy utf-8 strings from wikipedia.org and paste it right
> >>> into the code to test. After I compiled the parser with ICU, it returns
> >>> the string, but shorter. My xml has UTF-8 encoding set: <?xml
> >>> version='1.0' encoding='UTF-8'?>.
> >>>   
> >>>       
> >> If you just used cut & paste from your browser to your C++ code editor, 
> >> I can bet you are not pasting UTF-8 codepoints, but something in your 
> >> local code page. Can you attach your source code to this e-mail 
> >> (attached, not copied)?
> >>
> >> Alberto
> >>     
>

Re: Problems with xerces-c version 1.7.0 and UTF-8

Reply via email to