Does the patch from PDFBOX-908[1] fix this? I reviewed that patch a while ago but didn't have time to test it myself. I don't normally commit things without checking them myself, but if you can confirm that works, I'll get it committed to the trunk.
[1] https://issues.apache.org/jira/browse/PDFBOX-908 ---- Thanks, Adam From: "Timo Boehme (JIRA)" <[email protected]> To: [email protected] Date: 03/15/2011 02:37 Subject: [jira] Commented: (PDFBOX-979) errors in %%EOF handling (fix included) [ https://issues.apache.org/jira/browse/PDFBOX-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006855#comment-13006855 ] Timo Boehme commented on PDFBOX-979: ------------------------------------ I have some bogus PDF files where content starts immediately after '%%EOF': startxref 302041 %%EOF333 0 obj<</Length 15/Root In order to handle it like in the 'endobj' case I test if we start with '%%EOF' and unread all following content. New fixed version: String eof = ""; if(!pdfSource.isEOF()) eof = readLine(); // if there's more data to read, get the EOF flag // verify that EOF exists if(!"%%EOF".equals(eof)) { if( eof.startsWith( "%%EOF" ) ) { // content after marker -> unread with first space byte for read newline pdfSource.unread( SPACE_BYTE ); // we read a whole line; add space as newline replacement pdfSource.unread( eof.substring( 5 ).getBytes("ISO-8859-1") ); } else { // PDF does not conform to spec, we should warn someone log.warn("expected='%%EOF' actual='" + eof + "'"); // if we're not at the end of a file, just put it back and move on if(!pdfSource.isEOF()) { pdfSource.unread( SPACE_BYTE ); // we read a whole line; add space as newline replacement pdfSource.unread(eof.getBytes("ISO-8859-1")); } } } > errors in %%EOF handling (fix included) > --------------------------------------- > > Key: PDFBOX-979 > URL: https://issues.apache.org/jira/browse/PDFBOX-979 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 1.6.0 > Reporter: Timo Boehme > > The '%%EOF' handling in PDFParser has several errors. The current implementation (start from line 467): > String eof = ""; > if(!pdfSource.isEOF()) > readLine(); // if there's more data to read, get the EOF flag > > // verify that EOF exists > if("%%EOF".equals(eof)) { > // PDF does not conform to spec, we should warn someone > log.warn("expected='%%EOF' actual='" + eof + "'"); > // if we're not at the end of a file, just put it back and move on > if(!pdfSource.isEOF()) > pdfSource.unread(eof.getBytes("ISO-8859-1")); > } > The problems: > - eof variable gets no value > - comparison if("%%EOF".equals(eof)) must be negated > - unreading must first add a newline or space byte because we read with readline() (like in bug PDFBOX-978) > Corrected version: > String eof = ""; > if(!pdfSource.isEOF()) > eof = readLine(); // if there's more data to read, get the EOF flag > > // verify that EOF exists > if(!"%%EOF".equals(eof)) { > // PDF does not conform to spec, we should warn someone > log.warn("expected='%%EOF' actual='" + eof + "'"); > // if we're not at the end of a file, just put it back and move on > if(!pdfSource.isEOF()) { > pdfSource.unread( SPACE_BYTE ); // we read a whole line; add space as newline replacement > pdfSource.unread(eof.getBytes("ISO-8859-1")); > } > } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - FHA 203b; 203k; HECM; VA; USDA; Conventional - Warehouse Lines; FHA-Authorized Originators - Lending and Servicing in over 45 States www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.
