When I print it in hex format, I get �: 0xffffffd0 �: 0xffffffb1 �: 0xffffffd0 �: 0xffffffb1 �: 0xffffffd0 �: 0xffffffb1
Which I am not even sure what format, but maybe my shell does not know what it is. On Wed, 2008-09-17 at 15:39 +0200, Alberto Massari wrote: > Hi Anna, > if I am not mistaken, the code you attached doesn't have the sample data > you are trying to parse (e.g. parseString is used to parse the result of > a toXML call on an extension object). > However, you say "in the dom_wrapper.c I print the string before it is > passed to the xerces-c parser [...] and my value in utf-8 looks fine"; > in the code you write > > cout << "parseString: " << str << endl; > return parseMemory(str.c_str(),(int)str.length()); > > But the fact that your console prints the data as you expects doesn't > imply that the std::string contains real UTF-8; your shell could be > using a Japanese locale, and be able to print correctly > Shift_JIS-encoded strings (while failing to print UTF-8-encoded strings). > If you want to really see what you are considering UTF-8, replace that > cout << str with this code > > for(int i=0;i<str.length();i++) > cout << "0x" << hex << (int)str[i] << " "; > cout << endl; > > Alberto > > Anna Simbirtsev wrote: > > In the epp_eppXMLbase.cc in function createDOMDocument it calls > > parseString function from domtools::XercesParser. In the dom_wrapper.c I > > print the string before it is passed to the xerces-c parser in > > domtools::XercesParser::parseMemory function and my value in utf-8 looks > > fine. When it gets back from xerces-c a DOM_document, it uses XercesNode > > object(defined in dom_wrapper.h) to store the DOM_document and break it > > into nodes. Then in epp_eppXMLbase.cc, in function > > eppobject::epp::addExtensionElements(EPP_output & outputobject, const > > epp_extension_ref_seq_ref & extensions) > > > > it calls > > DomPrint dp(outputobject); > > dp.putDOMTree(extensionDoc); > > > > from dom_print.cc where I actually print the value in putDOMTree > > function. Here the value looks truncated. > > The entire source code of domtools is available on > > http://sourceforge.net/project/showfiles.php?group_id=26675 > > > > Thank you very much for your help. > > > > On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote: > > > >> Anna Simbirtsev wrote: > >> > >>> I pass just plain xml string to the DOMParser, so I don't use the > >>> transcode function. > >>> > >>> [...] > >>> I just copy utf-8 strings from wikipedia.org and paste it right > >>> into the code to test. After I compiled the parser with ICU, it returns > >>> the string, but shorter. My xml has UTF-8 encoding set: <?xml > >>> version='1.0' encoding='UTF-8'?>. > >>> > >>> > >> If you just used cut & paste from your browser to your C++ code editor, > >> I can bet you are not pasting UTF-8 codepoints, but something in your > >> local code page. Can you attach your source code to this e-mail > >> (attached, not copied)? > >> > >> Alberto > >> >
