[
https://issues.apache.org/jira/browse/PDFBOX-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler closed PDFBOX-344.
-------------------------------------
> PushbackInputStream returns partial strings
> -------------------------------------------
>
> Key: PDFBOX-344
> URL: https://issues.apache.org/jira/browse/PDFBOX-344
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 0.7.3
> Environment: Mac OS X 10.5
> Reporter: John F. Walsh
> Fix For: 0.8.0-incubator
>
> Original Estimate: 0h
> Remaining Estimate: 0h
>
> When org.pdfbox.pdfparser.BaseParser.parseDirObject() checks to see if it's
> reading the string "false" from pdfSource, that check can fail if there's a
> pause in the underlying read of the PDF file.
> org.pdfbox.io.PushBackInputStream extends java.io.PushBackInputStream.
> java.io.PushBackInputStream.read(byte[] b, int off, int len) will return a
> string like "fals" instead of "false" if there's a pause in the read of the
> pdf file being processed. (The PDF file that caused this problem can't be
> shared because it contains customer data.)
> The solution is to try the read again to read again until either enough bytes
> have been read or an EOF has been reached, in which case the read files
> should be returned. Adding the function override, below, to
> org.pdfbox.io.PushBackInputStream fixes the problem.
> I rated this bug Major because, though it's a show stopper when it happens, I
> suspect it's quite rare. But, in a production system, it matters.
> -------------------------------------
> /**
> * Reads up to <code>len</code> bytes of data from this input stream into
> * an array of bytes. This method first reads any pushed-back bytes;
> after
> * that, if fewer than <code>len</code> bytes have been read then it
> * reads from the underlying input stream. This method blocks until the
> * requested number of bytes have been read, or until the end of the
> stream
> * has been reached in which case it returns the number of bytes actually
> * read, or -1 if zero bytes were read.
> *
> * This overridden function enables
> <tt>org.pdfbox.pdfparser.BaseParser</tt>
> * to be assured that it has the entire string it's checking for
> (typically
> * "true" or "false" instead of returning a part of the string due to a
> * pause in the underlying stream read.
> *
> * @param b the buffer into which the data is read.
> * @param off the start offset of the data.
> * @param len the maximum number of bytes read.
> * @return the total number of bytes read into the buffer, or
> * <code>-1</code> if there is no more data because the end of
> * the stream has been reached.
> * @exception IOException if an I/O error occurs.
> * @see java.io.PushbackInputStream#read(byte[], int, int)
> */
> public int read(byte[] b, int off, int len) throws IOException {
> int bytesRead = super.read(b, off, len);
> /* if we received the expected number of bytes, or an EOF, return
> what we got: */
> if ((bytesRead == len) || (bytesRead == -1)){
> return bytesRead;
> }
>
> int byteRead = 0;
> while (bytesRead < len){
> /* if we're missing some bytes, read them one at a time
> until we have the required number or an EOF is read. */
> byteRead = super.read();
> if (byteRead == -1){
> /* If it's an EOF, return what we got and report the EOF
> on the next read: */
> return bytesRead;
> }
> /* Add the byte to the array and loop. */
> b[bytesRead] = (byte)byteRead;
> bytesRead++;
> }
> /* Report the full read complete: */
> return bytesRead;
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.