Jan Berkel created XERCESJ-1668: ----------------------------------- Summary: Off-by-one bug w/ surrogates in UTF8Reader Key: XERCESJ-1668 URL: https://issues.apache.org/jira/browse/XERCESJ-1668 Project: Xerces2-J Issue Type: Bug Components: Other Reporter: Jan Berkel
There's a bug in the surrogate handling when the reader buffer is exhausted and only the high-part can be written. On the next run the low-part gets added but the buffer space calculation is off by one. This gets triggered when parsing the current [enwiktionary dump file|http://dumps.wikimedia.org/enwiktionary/20151102/enwiktionary-20151102-pages-articles.xml.bz2]. {noformat} org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid byte 2 of 4-byte UTF-8 sequence. {noformat} In the attached patch I added a testcase for this bug. Another related issue is that when the low-part is written as last part of the stream -1 is returned instead of 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: j-dev-unsubscr...@xerces.apache.org For additional commands, e-mail: j-dev-h...@xerces.apache.org