Re: Problems with xerces-c version 1.7.0 and UTF-8

Anna Simbirtsev Tue, 16 Sep 2008 12:56:04 -0700

I pass just plain xml string to the DOMParser, so I don't use the
transcode function.


 const void * const buffer = str.c_str();

   ::DOMParser parser;
   parser.setDoNamespaces(true);
   parser.setToCreateXMLDeclTypeNode(false);
   MemBufInputSource* memBufIS = new MemBufInputSource
     (
      (const XMLByte*)buffer
      , length
      , "domtools"
      , false
      );

   try {
      parser.parse(*memBufIS);
      DOM_Document doc = parser.getDocument();
      delete memBufIS;
      if (!doc.isNull()) return new XercesNode(doc);
   } catch(...) {
      delete memBufIS;
   };
   return new XercesNode();

When I had no ICU, it was returning an empty string instead of utf-8
string. I just copy utf-8 strings from wikipedia.org and paste it right
into the code to test. After I compiled the parser with ICU, it returns
the string, but shorter. My xml has UTF-8 encoding set: <?xml
version='1.0' encoding='UTF-8'?>.

On Tue, 2008-09-16 at 12:47 -0700, David Bertoni wrote:
> Anna Simbirtsev wrote:
> > Hello,
> > 
> > I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
> > strings. Now the parser takes in UTF-8 string, but when it comes out its
> > truncated by a couple of characters. Can anybody help?
> Note that Xerces-C can parse documents encoded in UTF-8 _without_ 
> integrating the ICU.
> 
> Perhaps you are calling XMLString::transcode() or 
> DOMString::transcode()?  If so, please search the archives of the 
> mailing list, as this problem comes up often (in fact, just last week).
> 
> If not, then please provide more information about what you mean by 
> "when it comes out" and what characters are truncated.
> 
> Dave

Re: Problems with xerces-c version 1.7.0 and UTF-8

Reply via email to