Seva Alekseyev created TIKA-2140: ------------------------------------ Summary: ClassCastException on a valid PDF Key: TIKA-2140 URL: https://issues.apache.org/jira/browse/TIKA-2140 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.13 Environment: Windows 7 x64, JVM 1.8.0_101 Reporter: Seva Alekseyev
On the following PDF file, which opens fine in Adobe Reader: https://dl.dropboxusercontent.com/u/92341073/FDA%20Submission%2096%20Vol.%20III.pdf the Tika parser throws the following error: java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSDictionary at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:144) at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:38) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:166) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:159) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:153) at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:123) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144) Before that, PDFBox throws some warnings: 21 Oct 2016 11:46:35 WARN BaseParser - Invalid dictionary, found: '?' but expected: '/' at offset 22061056 21 Oct 2016 11:46:36 WARN BaseParser - Invalid dictionary, found: '?' but expected: '/' at offset 22061056 21 Oct 2016 11:46:36 WARN COSParser - Object (3:0) at offset 22059324 does not end with 'endobj' but with '' So the file is somewhat malformed, but not to the point of unreadability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)