Seva Alekseyev created TIKA-2140:
------------------------------------
Summary: ClassCastException on a valid PDF
Key: TIKA-2140
URL: https://issues.apache.org/jira/browse/TIKA-2140
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.13
Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
On the following PDF file, which opens fine in Adobe Reader:
https://dl.dropboxusercontent.com/u/92341073/FDA%20Submission%2096%20Vol.%20III.pdf
the Tika parser throws the following error:
java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast
to org.apache.pdfbox.cos.COSDictionary
at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:144)
at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:38)
at
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:166)
at
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:159)
at
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:153)
at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:123)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)
Before that, PDFBox throws some warnings:
21 Oct 2016 11:46:35 WARN BaseParser - Invalid dictionary, found: '?' but
expected: '/' at offset 22061056
21 Oct 2016 11:46:36 WARN BaseParser - Invalid dictionary, found: '?' but
expected: '/' at offset 22061056
21 Oct 2016 11:46:36 WARN COSParser - Object (3:0) at offset 22059324 does not
end with 'endobj' but with ''
So the file is somewhat malformed, but not to the point of unreadability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)