Stephen Collyer wrote:
David Bertoni wrote:
Stephen Collyer wrote:
I have a SAX2 parser which is exhibiting odd behaviour.
If I give it some XML with an XML declaration like:
<?xml version="1.0" encoding="UTF-8" ?>
it fails with a "Invalid document structure" error.
If I remove the encoding element, then it parses correctly.
This is quite strange, since the parser will assume the encoding is
UTF-8 without an encoding declaration. The only case where I could
imagine this might happen is with a UTF-16 document with an encoding
declaration that indicates a byte-oriented encoding. You can verify
this by looking at a binary dump of the XML stream.
Dave, thanks for that - I suspect I know what the problem is.
I am, in fact, handing Xerces a UTF-16 document with an encoding
that says UTF-8 - is that what you mean by a "byte oriented encoding"
i.e a variable length encoding ?
Yes. UTF-8 is byte-oriented. OK, well it's octet-oriented, but either
way, they are bytes in C++. UTF-16 is also a variable-length encoding,
but it uses 16-bit code units.
The reason for this is that I am receiving a document in UTF-8 with
a decln that indicates UTF-8, but I'm transcoding it to UTF-16 early
on to make it fit in a Qt QString (I'm using the Trolltech Qt libs).
However, of course, if I hand that off to Xerces, the encoding decln
no longer matches the true encoding, which I guess is the cause of
the problem. This only dawned on me after I'd read your comment.
Unfortunately, it's not a very good error message. If you want to, you
can create a Jira issue so we can possibly fix it one of these days.
The only way I can see to fix this is to edit the decln in code.
Or can I tell Xerces to ignore it somehow ? Advice appreciated.
The easiest way is to set the encoding on the InputSource to "UTF-16",
which will force the parser to use that encoding.
Dave