[ https://issues.apache.org/jira/browse/XERCESJ-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100713#comment-17100713 ]
Craig Berry commented on XERCESJ-1668: -------------------------------------- I encountered the same problem described here trying to load a file into the eXist XML database, which uses Xerces-J to validate input. With the help of Adam Retter of the eXist project when [I reported it to them|https://github.com/eXist-db/exist/issues/3400], I was able to reproduce it by just running the Counter program in the Xerces-J samples directory: {code:java} $ java -classpath samples/:./build/xercesImpl.jar sax.Counter A35965.xml [Fatal Error] A35965.xml:29146:16: Invalid byte 2 of 4-byte UTF-8 sequence.{code} And then I applied the surrogate.diff patch in this ticket; rebuilding and rerunning show that the patch fixes it: {code:java} $ patch -p0 -i surrogate.patch patching file src/org/apache/xerces/impl/io/UTF8Reader.java patching file tests/io/UTF8ReaderTests.java $ ant jar ... $ java -classpath samples/:./build/xercesImpl.jar sax.Counter A35965.xml A35965.xml: 107 ms (71396 elems, 198504 attrs, 0 spaces, 776615 chars){code} I attempted to attach the XML file here and it ignored (maybe too big?) but it's available in the [GitHub eXist ticket|https://github.com/eXist-db/exist/files/4554916/A35965.zip]. It would be nice to see the patch, which has been sitting here for 4 1/2 years, get applied and released. > Off-by-one bug w/ surrogates in UTF8Reader > ------------------------------------------ > > Key: XERCESJ-1668 > URL: https://issues.apache.org/jira/browse/XERCESJ-1668 > Project: Xerces2-J > Issue Type: Bug > Components: Other > Reporter: Jan Berkel > Priority: Major > Attachments: surrogate.patch > > > There's a bug in the surrogate handling when the reader buffer is exhausted > and only the high-part can be written. On the next run the low-part gets > added but the buffer space calculation is off by one. > This gets triggered when parsing the current [enwiktionary dump > file|http://dumps.wikimedia.org/enwiktionary/20151102/enwiktionary-20151102-pages-articles.xml.bz2]. > {noformat} > org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; > Invalid byte 2 of 4-byte UTF-8 sequence. > {noformat} > In the attached patch I added a fix + testcase for this bug. Another related > issue is that when the low-part is written as last part of the stream -1 is > returned instead of 1. > Is UTF8Reader still necessary? It might be safer to just use a plain > InputStreamReader. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: j-dev-unsubscr...@xerces.apache.org For additional commands, e-mail: j-dev-h...@xerces.apache.org