[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902835#comment-14902835
 ] 

Tim Allison commented on TIKA-1737:
-----------------------------------

See PDFBOX-2986 for a resource leak discovered through testing against a file 
in Common Crawl that triggered a ttfparser exception that was close to some of 
yours.  I think this didn't affect you because your ttf exceptions are 
triggered within a PDFFile, and the MemoryTTFDataStream would have been used.

bq. It's actually a Tomcat instance that contains both Lucene indexer and 
search, where Tika is being used for text extraction for the Lucene indexer.

Ah, ok, that's right.  Apologies for the repetition with my soapbox in 
TIKA-1471...I realize this is the easiest way to build an app, but Tika can run 
into serious problems, and I'd strongly encourage trying to keep Tika out of 
the same JVM as Lucene if at all possible.  This is not to say we shouldn't fix 
Tika and its dependencies when problems are found!

> PDFBox 1.8.10 is still a basket case
> ------------------------------------
>
>                 Key: TIKA-1737
>                 URL: https://issues.apache.org/jira/browse/TIKA-1737
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.10
>         Environment: Linux, Solaris
>            Reporter: Alan Burlison
>         Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to