I pass just plain xml string to the DOMParser, so I don't use the
transcode function.
const void * const buffer = str.c_str();
::DOMParser parser;
parser.setDoNamespaces(true);
parser.setToCreateXMLDeclTypeNode(false);
MemBufInputSource* memBufIS = new MemBufInputSource
(
(const XMLByte*)buffer
, length
, "domtools"
, false
);
try {
parser.parse(*memBufIS);
DOM_Document doc = parser.getDocument();
delete memBufIS;
if (!doc.isNull()) return new XercesNode(doc);
} catch(...) {
delete memBufIS;
};
return new XercesNode();
When I had no ICU, it was returning an empty string instead of utf-8
string. I just copy utf-8 strings from wikipedia.org and paste it right
into the code to test. After I compiled the parser with ICU, it returns
the string, but shorter. My xml has UTF-8 encoding set: <?xml
version='1.0' encoding='UTF-8'?>.
On Tue, 2008-09-16 at 12:47 -0700, David Bertoni wrote:
> Anna Simbirtsev wrote:
> > Hello,
> >
> > I compiled xerces-c 1.7.0 with ICU 4.0 to be able to handle UTF-8
> > strings. Now the parser takes in UTF-8 string, but when it comes out its
> > truncated by a couple of characters. Can anybody help?
> Note that Xerces-C can parse documents encoded in UTF-8 _without_
> integrating the ICU.
>
> Perhaps you are calling XMLString::transcode() or
> DOMString::transcode()? If so, please search the archives of the
> mailing list, as this problem comes up often (in fact, just last week).
>
> If not, then please provide more information about what you mean by
> "when it comes out" and what characters are truncated.
>
> Dave