Alan Burlison created TIKA-1737:
-----------------------------------
Summary: PDFBox 1.8.10 is still a basket case
Key: TIKA-1737
URL: https://issues.apache.org/jira/browse/TIKA-1737
Project: Tika
Issue Type: Bug
Components: general
Affects Versions: 1.10
Environment: Linux, Solaris
Reporter: Alan Burlison
In TIKA-1471 I reported OOM errors when parsing PDF files. According to that
bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather
than PDFBox being better it's actually far, far worse. With the same corpus,
Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox
1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I
can tell, the memory leaks are even worse in 1.8.10 as well.
I've had to resort to destroying the Tika instances and starting over each time
there's an error indexing a PDF file. It's so bad I'm going to switch to
running pdftotext (part of Xpdf) as an external process. Note that many of the
errors in PDFBox are clearly caused by programming errors, e.g.
ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and
EOFException.
I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a
replacement for PDFBox as 1.8.10 just isn't fit for purpose.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)