[
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180402#comment-14180402
]
Tim Allison commented on TIKA-1442:
-----------------------------------
Sorry, ran new eval code on old 1.8.8 batch process. Will rerun batch process
with latest 1.8.8.
For file 27372.pdf, I see this in Excel:
{noformat}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137)
at
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120)
at
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153)
at
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96)
at
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary
cannot be cast to org.apache.pdfbox.cos.COSStream
at
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312)
at
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
... 13 more
{noformat}
Should I try to grab more than that? Or, are you seeing the same thing that
I'm seeing in the Excel file?
> Upgrade to PDFBox 1.8.8
> -----------------------
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx,
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to
> 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika
> 1.7. Let's use this issue to carry on the discussion of regression testing
> (if any further discussion is necessary) or any other prep that needs to
> happen before 1.8.8's release.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)