Seva Alekseyev created TIKA-2140:
------------------------------------

             Summary: ClassCastException on a valid PDF
                 Key: TIKA-2140
                 URL: https://issues.apache.org/jira/browse/TIKA-2140
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.13
         Environment: Windows 7 x64, JVM 1.8.0_101
            Reporter: Seva Alekseyev


On the following PDF file, which opens fine in Adobe Reader:

https://dl.dropboxusercontent.com/u/92341073/FDA%20Submission%2096%20Vol.%20III.pdf

the Tika parser throws the following error:

java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
to org.apache.pdfbox.cos.COSDictionary
        at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:144)
        at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:38)
        at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:166)
        at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:159)
        at 
org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:153)
        at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:123)
        at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314)
        at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)

Before that, PDFBox throws some warnings:

21 Oct 2016 11:46:35  WARN BaseParser - Invalid dictionary, found: '?' but 
expected: '/' at offset 22061056
21 Oct 2016 11:46:36  WARN BaseParser - Invalid dictionary, found: '?' but 
expected: '/' at offset 22061056
21 Oct 2016 11:46:36  WARN COSParser - Object (3:0) at offset 22059324 does not 
end with 'endobj' but with ''

So the file is somewhat malformed, but not to the point of unreadability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to