Series of exceptions from PDFBox
--------------------------------
Key: TIKA-617
URL: https://issues.apache.org/jira/browse/TIKA-617
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.0
Reporter: Erik Hetzner
Hi,
I am getting the following exception from PDFBox. Thank you!
(If I should file these upstream at PDFBox first, please let me know.)
{preformat}
$ java -jar tika-app-1.0-SNAPSHOT.jar
http://www.arb.ca.gov/research/apr/past/01-340.pdf > /dev/null
ERROR - Stop reading corrupt stream
INFO - unsupported/disabled operation: f24.481
INFO - unsupported/disabled operation: ree)n.
WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be
cast to org.apache.pdfbox.cos.COSArray
java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast
to org.apache.pdfbox.cos.COSArray
at
org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
INFO - unsupported/disabled operation: i-
INFO - unsupported/disabled operation: R4%
INFO - unsupported/disabled operation: )
INFO - unsupported/disabled operation: Re.8
INFO - unsupported/disabled operation: e.
INFO - unsupported/disabled operation: FE)-
WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be
cast to org.apache.pdfbox.cos.COSArray
java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast
to org.apache.pdfbox.cos.COSArray
at
org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
INFO - unsupported/disabled operation: R3%
INFO - unsupported/disabled operation: T
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
Caused by: java.lang.RuntimeException: java.io.IOException: Error: Expected
operator 'ID' actual='I8'
at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)
at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 5 more
Caused by: java.io.IOException: Error: Expected operator 'ID' actual='I8'
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:382)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175)
... 15 more
{preformat}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira