Alan Burlison created TIKA-1737:
-----------------------------------

             Summary: PDFBox 1.8.10 is still a basket case
                 Key: TIKA-1737
                 URL: https://issues.apache.org/jira/browse/TIKA-1737
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 1.10
         Environment: Linux, Solaris
            Reporter: Alan Burlison


In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
than PDFBox being better it's actually far, far worse. With the same corpus, 
Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
can tell, the memory leaks are even worse in 1.8.10 as well.

I've had to resort to destroying the Tika instances and starting over each time 
there's an error indexing a PDF file. It's so bad I'm going to switch to 
running pdftotext (part of Xpdf) as an external process. Note that many of the 
errors in PDFBox are clearly caused by programming errors, e.g. 
ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
EOFException.

I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to