This is a known problem in Xerces (http://issues.apache.org/jira/browse/XERCESC-1288); the invalid byte is detected by the transcoder when reading a new chunk of data, but before the good data is processed. So the last known position is distant from the error location. The fix could be returning the data that is valid, and report the error only when the bad data at the beginning of the chunk.

Alberto

Boris Kolpackov wrote:
Hi Igor,

Igor Ignatyuk <igor_ignati...@hotmail.com> writes:

I am parsing the next file that is encoded as Windows-1252:

<?xml version="1.0" ?>
<test>ä</test>

The implicit XML encoding is UTF-8, therefore it is correct that I get a
parsing error, but SAXParseException::getLineNumber and
SAXParseException::getColumnNumber return wrong values:

line 1, column 23: invalid byte '<' at position 2 of a 3-byte sequence

IMHO the line number should be 2, the column number - 8 (position of the
character '<') or 7 (position of the character 'ä').

Can you submit a bug report for this problem and attach the sample
XML to it (I cannot reproduce the problem by copying and pasting
the XML fragment from the email):

http://xerces.apache.org/xerces-c/bug-report.html


Thanks,
        Boris


Reply via email to