[
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901042#comment-14901042
]
Tilman Hausherr edited comment on TIKA-1737 at 9/21/15 8:49 PM:
----------------------------------------------------------------
Some of the exceptions (the classcastexceptions in the
org.apache.pdfbox.util.operator) have an obvious cause that I have fixed in
PDFBOX-2982. For others I would need to get the PDF files, and I'm not sure
that these can be fixed in the 1.8 version.
The best would be to create an issue in PDFBox for each class of errors. And
then track whether the number of unchecked exceptions goes down.
was (Author: tilman):
Some of the exceptions (the classcastexceptions in the
org.apache.pdfbox.util.operator) have an obvious cause that would be easy to
prevent. For others I would need to get the PDF files, and I'm not sure that
these can be fixed in the 1.8 version.
The best would be to create an issue in PDFBox for each class of errors. And
then track whether the number of unchecked exceptions goes down.
> PDFBox 1.8.10 is still a basket case
> ------------------------------------
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 1.10
> Environment: Linux, Solaris
> Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather
> than PDFBox being better it's actually far, far worse. With the same corpus,
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each
> time there's an error indexing a PDF file. It's so bad I'm going to switch to
> running pdftotext (part of Xpdf) as an external process. Note that many of
> the errors in PDFBox are clearly caused by programming errors, e.g.
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)