Re: SAX2 parser: encoding="UTF-8" breaks validation

David Bertoni Mon, 23 Jun 2008 10:02:25 -0700

Stephen Collyer wrote:

David Bertoni wrote:

Stephen Collyer wrote:

I have a SAX2 parser which is exhibiting odd behaviour.


If I give it some XML with an XML declaration like:

<?xml version="1.0" encoding="UTF-8" ?>

it fails with a "Invalid document structure" error.
If I remove the encoding element, then it parses correctly.

This is quite strange, since the parser will assume the encoding is
UTF-8 without an encoding declaration.  The only case where I could
imagine this might happen is with a UTF-16 document with an encoding
declaration that indicates a byte-oriented encoding.  You can verify
this by looking at a binary dump of the XML stream.


Dave, thanks for that - I suspect I know what the problem is.
I am, in fact, handing Xerces a UTF-16 document with an encoding
that says UTF-8 - is that what you mean by a "byte oriented encoding"
i.e a variable length encoding ?

Yes. UTF-8 is byte-oriented. OK, well it's octet-oriented, but eitherway, they are bytes in C++. UTF-16 is also a variable-length encoding,but it uses 16-bit code units.


The reason for this is that I am receiving a document in UTF-8 with
a decln that indicates UTF-8, but I'm transcoding it to UTF-16 early
on to make it fit in a Qt QString (I'm using the Trolltech Qt libs).
However, of course, if I hand that off to Xerces, the encoding decln
no longer matches the true encoding, which I guess is the cause of
the problem. This only dawned on me after I'd read your comment.

Unfortunately, it's not a very good error message. If you want to, youcan create a Jira issue so we can possibly fix it one of these days.


The only way I can see to fix this is to edit the decln in code.
Or can I tell Xerces to ignore it somehow ? Advice appreciated.

The easiest way is to set the encoding on the InputSource to "UTF-16",which will force the parser to use that encoding.


Dave

Re: SAX2 parser: encoding="UTF-8" breaks validation

Reply via email to