[ 
http://issues.apache.org/jira/browse/XERCESC-1288?page=comments#action_12423637 
] 
            
HAUTION Philippe commented on XERCESC-1288:
-------------------------------------------

I met the same annoying bug with xerces C++ 2.7.0.
I looked into the code and saw that in the XMLScanner::scanProlog() method, 
there is a call to ReaderMgr::peekNextChar() which in fact fills an internal 
buffer through XMLReader::refreshCharBuffer().that it tries to transcode with 
XMLUTF8Transcoder::transcodeFrom. When there is an invalide character, an 
UTFDataFormatException is thrown from this method, without any information 
about a line or column number. 
This exception is caught in IGXMLScanner::scanDocument and converted into a 
SAXParseException with current line and column number : ie the first line that 
was correctly scanned.
Unfortunately I don't see any easy way to improve this behaviour but I am not a 
xerces developer.

> Wrong line/column number in UTFDataFormatException
> --------------------------------------------------
>
>                 Key: XERCESC-1288
>                 URL: http://issues.apache.org/jira/browse/XERCESC-1288
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: SAX/SAX2, Non-Validating Parser, DOM
>    Affects Versions: 2.6.0, 2.5.0
>         Environment: Linux (SUSE 9.1, Fedora core 2, Redhat 9) on Intel, 
> Solaris 7 on SPARC,  various gcc versions.
>            Reporter: Valerio Gionco
>            Priority: Minor
>
> I've the following (bad) XML file:
> --------------- bad.xml ----------------------------
> <?xml version="1.0" encoding="UTF-8"?>
> <block>
>         <field>Blah blah</field>
>         <field>Blah blah ò blah blah</field>
>         <field>Blah blah</field>
> </block>
> ----------------------------------------------------
> (note the accented 'o' in the 2nd "field" line - hope it won't be
> destroyed...)
> The file is bad because the accented 'o' is represented with a single
> byte, 0xf2. This is the hed dump:
> 3e 42 6c 61 68 20 62 6c  61 68 20 f2 20 62 6c 61  |>Blah blah . bla|
> Problem is, when I run "SAXPrint bad.xml" i get the following error:
> Fatal Error at file /users/valerio/tmp/bad.xml, line 1, char 39
>   Message: An exception occurred! Type:UTFDataFormatException, 
> Message:invalid byte 2 ( ) of a 4-byte sequence.
> The row and column reported by SAXParseException::getColumnNumber()
> and SAXParseException::getLineNumber() are wrong. I seem to recall
> this was not the case with older (2.0 or 2.2?) versions of Xerces-C,
> but I'm not sure.
> I noticed the issue with 2.5, then tried with 2.6 but there was
> no apparent difference. Can somebody take care of this? We often
> have big XML files to parse, and not knowing where the error
> really is is a real pain.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to