[ 
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697167#action_12697167
 ] 

Michael McCandless commented on XERCESJ-1257:
---------------------------------------------

Over on the Apache Lucene project, we are also hitting this issue 
(https://issues.apache.org/jira/browse/LUCENE-1591).  We use Wikipedia's XML 
export for scalability testing.  I'm seeing this issue on the 20090306 release 
(http://download.wikimedia.org/enwiki/20090306/enwiki-20090306-pages-articles.xml.bz2).

I'm using Xerces2 2.9.1, and I'm doing the suggested workaround (wrapping with 
a java.io.InputStreamReader) yet I still hit the MalformedByteSequenceException.

It seems that Wikipedia's XML export is a good test case for Xerces2.

Is there any other suggested workaround here?

> buffer overflow in UTF8Reader for characters out of BMP
> -------------------------------------------------------
>
>                 Key: XERCESJ-1257
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1257
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: JAXP (javax.xml.parsers)
>    Affects Versions: 2.9.0
>         Environment: Any
>            Reporter: Robert Stojnic
>            Assignee: Michael Glavassevich
>            Priority: Minor
>         Attachments: TestXerces.java, UTF8Reader.patch
>
>
> There is a ArrayOutOfBoundsException in org.apache.xerces.impl.io.UTF8Reader, 
> in read(char[],int,int) for 4-byte utf-8 chars.
> Imagine a following scenario. read() has a buffer of size N, and it reads N-1 
> ascii chars, and stores it in the output buffer. Let the Nth char be the 
> first byte of a 4 byte utf-8 char. The other 3 bytes are fetched by invoking 
> read() on the input stream. From these a surrogate pair of java chars is 
> made, however, method does not check if both chars can fit into the output 
> buffer ... In most cases, they would fit into the ouput buffer (e.g. if there 
> are some other multi-byte chars in the fetched text), so the bug is very 
> rare, but it still happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to