[
https://issues.apache.org/jira/browse/TIKA-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208089#comment-14208089
]
Alan Burlison commented on TIKA-1471:
-------------------------------------
In my case I'm using Tika to extract text from a corpus of around 350,000
documents, many of which are attachments to emails that I'm in turn handling
with JavaMail. I therefore don't have an on-disk representation of many of the
documents so doing all the processing inside the same JVM makes life a little
easier. To keep performance reasonable I'm also using a thread pool with each
thread containing a Tika instance which is reused for many (10s of thousands)
documents . During a full re-index memory use creeps inexorably upwards but as
I destroy the thread pool after each indexing run the memory is reclaimed. I'm
guessing that one or more of the components that Tika uses is a bit tardy in
releasing memory.
> OOM with corrupt PDF file
> -------------------------
>
> Key: TIKA-1471
> URL: https://issues.apache.org/jira/browse/TIKA-1471
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 1.6
> Environment: Linux, JVM 1.8.0_25-b17, 64-bit
> Reporter: Alan Burlison
> Priority: Blocker
> Fix For: 1.7
>
>
> Use of PDFBox 1.8.6 by Tika 1.6 is causing OOM errors with corrupt PDF files,
> due to a bug in PDFBox, see PDFBOX-2493. This makes Tika 1.6 unusable from
> inside a long-running webapp and I've had to revert to Tika 1.5. Although 1.5
> also throws errors with the corrupt file it does not cause OOM errors.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)