[ https://issues.apache.org/jira/browse/PDFBOX-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated PDFBOX-1299: --------------------------------------- Attachment: TX0819_2009-07-27_Windstream-TCG_Agreement.pdf Test PDF showing the problem. I got the PDF from http://acrobatusers.com/gallery/pdf_portfolio_gallery, specifically http://acrobatusers.com/assets/uploads/gallery/Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf In this PDF, at offset=446726, we have a "4 0 obj" stream, with Length=368286. If you skip ahead by that length, the next object is "5 0 obj". But, unfortunately, within those bytes is an "endstream" on its own line, just before offset=714247 (this "belongs" to the embedded PDF), and that causes readUntilEndOfStream to stop too early, leading to this IOException when running ExtractText (on current trunk): {noformat} Exception in thread "main" java.io.IOException: Unknown dir object c=']' cInt=93 peek=']' peekInt=93 757109 at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1215) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:216) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:342) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1117) ununun at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:557) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:980) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:196) at org.apache.pdfbox.ExtractText.main(ExtractText.java:76) {noformat} > BaseParser.readUntilEndOfStream can stop too early, causing IOException on > valid PDFs > ------------------------------------------------------------------------------------- > > Key: PDFBOX-1299 > URL: https://issues.apache.org/jira/browse/PDFBOX-1299 > Project: PDFBox > Issue Type: Bug > Affects Versions: 1.6.0 > Reporter: Michael McCandless > Attachments: TX0819_2009-07-27_Windstream-TCG_Agreement.pdf > > > The purpose of BaseParser.readUntilEndOfStream is to scan ahead, > copying bytes to the output, stopping once it sees "endstream". > The problem with this approach is sometimes the stream data itself > contains endstream causing readUntilEndOfStream to stop too early. > This can legitimately happen when the stream is an embedded PDF; I'll > attach a test PDF showing this. > However, the stream dict declares the stream length (in bytes)... so > it seems like we should be respecting that length (if present) and > simply copy over that many bytes, instead of scanning the stream bytes > for endstream? This should be a lot faster too... > I imagine we always scan so that we are more robust if the length is > missing/invalid? Is that why this method was used? (I don't know the > history here...). If so, maybe we can have an option to use > the declared stream length if present. > I have a patch to use the declared stream length (if present), and it enables > at least this test PDF to correctly parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira