[
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098393#comment-15098393
]
Tim Allison commented on TIKA-1830:
-----------------------------------
Finished the rerun...and the results look the same.
Question: On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11.
Are you sure that that affects 1.8.10? The discovery of that wouldn't have
happened unless I was actually running 1.8.11.
In 1.8.10, 074531.pdf has ~30k words. When I run 1.8.11 as a unit test within
our PDFParser wrapper, I also get ~30k words. However, when I rerun our batch
wrapper around 1.8.11 on this file, I get the same exception in a rerun as I
did in the original run (reported in the reports attached yesterday).
The exception is:
{noformat}
java.lang.NullPointerException
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1077)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:276)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:49)
at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:193)
at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:205)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:256)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:471)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:395)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:354)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
{noformat}
I get the same exception when I run this in our batch code with 1 consumer or
10 consumers...so it isn't a multithreading issue....hmmmm....will dig some
more.
As a side note, I thought I wasn't comparing contents if there was an exception
in one of the files...I need to fix my SQL to make sure this is the case.
> Upgrade to PDFBox 1.8.11 when available
> ---------------------------------------
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)