Please don't delete comments, if they are referenced by follow ups.
BR Andreas Lehmkühler Am 16.12.2013 17:09, schrieb Tilman Hausherr (JIRA):
[ https://issues.apache.org/jira/browse/PDFBOX-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-1769: ------------------------------------ Comment: was deleted (was: There's another problem in that parser: I get this exception with the file amyuni2_05d__pdf1_3_acro4x.pdf (it was once part of the project, now no more, but it can still be found on the web): java.io.IOException: Object (48:0) at offset 161333 does not end with 'endobj'. at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1312) at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1159) at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:1133) at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:470) at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:731) at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139) at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1122) at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:134) at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:78) This is true, the "endobject" is indeed missing in that file. However the content of endObjectKey is 49 0 obj, i.e. the start of a new object. So my suggestion is to change the segment at {code} if (!endObjectKey.startsWith("endobj")) { throw new IOException("Object (" + readObjNr + ":" + readObjGen + ") at offset " + offsetOrObjstmObNr + " does not end with 'endobj'."); } {code} to {code} if (!endObjectKey.startsWith("endobj")) { if (endObjectKey.endsWith(" obj")) LOG.warn("Object (" + readObjNr + ":" + readObjGen + ") at offset " + offsetOrObjstmObNr + " does not end with 'endobj' but with '" + endObjectKey + "'"); else throw new IOException("Object (" + readObjNr + ":" + readObjGen + ") at offset " + offsetOrObjstmObNr + " does not end with 'endobj' but with '" + endObjectKey + "'"); } {code})Fix crash on invalid xref ------------------------- Key: PDFBOX-1769 URL: https://issues.apache.org/jira/browse/PDFBOX-1769 Project: PDFBox Issue Type: Wish Components: Parsing Affects Versions: 1.8.2 Reporter: William Palmer Assignee: Andreas Lehmkühler Fix For: 1.8.4, 2.0.0 Need to search for a correct xref start address Example file: http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf Exception in thread "main" java.io.IOException: Error: Expected an integer type, actual='ref' at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622) Using the code: PDFTextStripper ts = new PDFTextStripper(); PrintWriter out = new PrintWriter(new FileWriter(new File (pFile+".txt"))); RandomAccess scratchFile = new RandomAccessFile(File.createTempFile("pdfbox-", ".tmp"), "rw"); PDDocument doc = PDDocument.loadNonSeq(new File(pFile), scratchFile) ts.setForceParsing(true); ts.writeText(doc, out); Related: PDFBOX-1757-- This message was sent by Atlassian JIRA (v6.1.4#6159)
