[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

Alan Burlison (JIRA) Tue, 22 Sep 2015 06:52:35 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902657#comment-14902657
 ]


Alan Burlison commented on TIKA-1737:
-------------------------------------

.bq Could we have done something at the Tika level to cause this...I wonder?

I don't believe so. I think PDFBox is just not cleaning up properly after an 
exception. If you want to 'fix' (?) this at the Tika level I think you'd have 
to do something similar to what I'm doing and create a new PDFBox instance each 
time there's a PDFBox exception.

.bq Does the heap usage jump for every type of exception...that is, if I find 
any old PDF that triggers an exception, do you think I'll see this with Tika 
1.10?

Pretty much. I'm going to try to get a heap dump to work on but that means 
undoing all the workaround code I've added, so it will take a bit for me to do 
that.

.bq Out of curiosity, are you using Tika in the same jvm as Lucene?

Yes, the app is the same as described in TIKA-1471. It's actually a Tomcat 
instance that contains both Lucene indexer and search, where Tika is being used 
for text extraction for the Lucene indexer.


> PDFBox 1.8.10 is still a basket case
> ------------------------------------
>
>                 Key: TIKA-1737
>                 URL: https://issues.apache.org/jira/browse/TIKA-1737
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.10
>         Environment: Linux, Solaris
>            Reporter: Alan Burlison
>         Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

Reply via email to