[
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902580#comment-14902580
]
Alan Burlison commented on TIKA-1737:
-------------------------------------
The heap dump is huge and the profiler struggles to cope so I haven't managed
to do any detailed analysis yet. There is a pool of Tika parser threads that
are used to handle the corpus, each thread is reused to extract text from
multiple documents which is then fed into Lucene. With Tika 1.10, every time a
Tika instance sees an exception from PDFBox the heap usage jumps up and doesn't
recover, leading to OOM when the index is just a short way through. That
doesn't happen with Tika 1.5. I've modified the indexer so that rather than
just logging the Tika exceptions it destroys the relevant Tika instance, does a
forced GC and then creates a new Tika instance. With Tika 1.10 that keeps the
heap size within reasonable bounds. To me that seems like pretty conclusive
proof that PDFBox is leaking when it throws exceptions.
> PDFBox 1.8.10 is still a basket case
> ------------------------------------
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 1.10
> Environment: Linux, Solaris
> Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather
> than PDFBox being better it's actually far, far worse. With the same corpus,
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each
> time there's an error indexing a PDF file. It's so bad I'm going to switch to
> running pdftotext (part of Xpdf) as an external process. Note that many of
> the errors in PDFBox are clearly caused by programming errors, e.g.
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)