[ http://issues.apache.org/jira/browse/XERCESC-1288?page=comments#action_12423637 ] HAUTION Philippe commented on XERCESC-1288: -------------------------------------------
I met the same annoying bug with xerces C++ 2.7.0. I looked into the code and saw that in the XMLScanner::scanProlog() method, there is a call to ReaderMgr::peekNextChar() which in fact fills an internal buffer through XMLReader::refreshCharBuffer().that it tries to transcode with XMLUTF8Transcoder::transcodeFrom. When there is an invalide character, an UTFDataFormatException is thrown from this method, without any information about a line or column number. This exception is caught in IGXMLScanner::scanDocument and converted into a SAXParseException with current line and column number : ie the first line that was correctly scanned. Unfortunately I don't see any easy way to improve this behaviour but I am not a xerces developer. > Wrong line/column number in UTFDataFormatException > -------------------------------------------------- > > Key: XERCESC-1288 > URL: http://issues.apache.org/jira/browse/XERCESC-1288 > Project: Xerces-C++ > Issue Type: Bug > Components: SAX/SAX2, Non-Validating Parser, DOM > Affects Versions: 2.6.0, 2.5.0 > Environment: Linux (SUSE 9.1, Fedora core 2, Redhat 9) on Intel, > Solaris 7 on SPARC, various gcc versions. > Reporter: Valerio Gionco > Priority: Minor > > I've the following (bad) XML file: > --------------- bad.xml ---------------------------- > <?xml version="1.0" encoding="UTF-8"?> > <block> > <field>Blah blah</field> > <field>Blah blah ò blah blah</field> > <field>Blah blah</field> > </block> > ---------------------------------------------------- > (note the accented 'o' in the 2nd "field" line - hope it won't be > destroyed...) > The file is bad because the accented 'o' is represented with a single > byte, 0xf2. This is the hed dump: > 3e 42 6c 61 68 20 62 6c 61 68 20 f2 20 62 6c 61 |>Blah blah . bla| > Problem is, when I run "SAXPrint bad.xml" i get the following error: > Fatal Error at file /users/valerio/tmp/bad.xml, line 1, char 39 > Message: An exception occurred! Type:UTFDataFormatException, > Message:invalid byte 2 ( ) of a 4-byte sequence. > The row and column reported by SAXParseException::getColumnNumber() > and SAXParseException::getLineNumber() are wrong. I seem to recall > this was not the case with older (2.0 or 2.2?) versions of Xerces-C, > but I'm not sure. > I noticed the issue with 2.5, then tried with 2.6 but there was > no apparent difference. Can somebody take care of this? We often > have big XML files to parse, and not knowing where the error > really is is a real pain. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
