[jira] [Commented] (XERCESJ-1668) Off-by-one bug w/ surrogates in UTF8Reader

Craig Berry (Jira) Wed, 06 May 2020 04:43:43 -0700


    [ 
https://issues.apache.org/jira/browse/XERCESJ-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100713#comment-17100713
 ]


Craig Berry commented on XERCESJ-1668:
--------------------------------------

I encountered the same problem described here trying to load a file into the 
eXist XML database, which uses Xerces-J to validate input.  With the help of 
Adam Retter of the eXist project when [I reported it to 
them|https://github.com/eXist-db/exist/issues/3400], I was able to reproduce it 
by just running the Counter program in the Xerces-J samples directory:

 
{code:java}
$ java -classpath samples/:./build/xercesImpl.jar sax.Counter A35965.xml
[Fatal Error] A35965.xml:29146:16: Invalid byte 2 of 4-byte UTF-8 
sequence.{code}
 

And then I applied the surrogate.diff patch in this ticket; rebuilding and 
rerunning show that the patch fixes it:

 
{code:java}
$ patch -p0 -i surrogate.patch
patching file src/org/apache/xerces/impl/io/UTF8Reader.java
patching file tests/io/UTF8ReaderTests.java
$ ant jar
...
$ java -classpath samples/:./build/xercesImpl.jar sax.Counter A35965.xml
A35965.xml: 107 ms (71396 elems, 198504 attrs, 0 spaces, 776615 chars){code}
 

I attempted to attach the XML file here and it ignored (maybe too big?) but 
it's available in the [GitHub eXist 
ticket|https://github.com/eXist-db/exist/files/4554916/A35965.zip].

 

It would be nice to see the patch, which has been sitting here for 4 1/2 years, 
get applied and released.

> Off-by-one bug w/ surrogates in UTF8Reader
> ------------------------------------------
>
>                 Key: XERCESJ-1668
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1668
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: Other
>            Reporter: Jan Berkel
>            Priority: Major
>         Attachments: surrogate.patch
>
>
> There's a bug in the surrogate handling when the reader buffer is exhausted 
> and only the high-part can be written. On the next run the low-part gets 
> added but the buffer space calculation is off by one.
> This gets triggered when parsing the current [enwiktionary dump 
> file|http://dumps.wikimedia.org/enwiktionary/20151102/enwiktionary-20151102-pages-articles.xml.bz2].
> {noformat}
> org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; 
> Invalid byte 2 of 4-byte UTF-8 sequence.
> {noformat}
> In the attached patch I added a fix + testcase for this bug. Another related 
> issue is that when the low-part is written as last part of the stream -1 is 
> returned instead of 1.
> Is UTF8Reader still necessary? It might be safer to just use a plain 
> InputStreamReader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-dev-h...@xerces.apache.org

[jira] [Commented] (XERCESJ-1668) Off-by-one bug w/ surrogates in UTF8Reader

Reply via email to