[
https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15462841#comment-15462841
]
Tim Barrett commented on TIKA-2058:
-----------------------------------
The problem occurs in Tika 1.13, not in previous versions.
I am now running the big data parse run again, with PDF parsing commented out,
this should help to pinpoint whether or not PDF box is the source of the OOM
problems.
We will definitely have a (relatively) small number of password protected PDF
files in the full set of input files. Would it be possible for me to get hold
of the PDF BOX fix for TIKA-2045, as this may then help to confirm whether the
same issue as at the root of this problem, it sounds to me as though a small
number of these files would be enough to use up available memory, even on a
process with lots of memory (our xmx max is 8GB).
> Memory Leak in Tika version 1.13 when parsing millions of files
> ---------------------------------------------------------------
>
> Key: TIKA-2058
> URL: https://issues.apache.org/jira/browse/TIKA-2058
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.13
> Reporter: Tim Barrett
>
> We have an application using Tika which parses roughly 7,000,000 files of
> different types, many of the files are MSG files with attachments. This works
> correctly with Tika 1.9, and has been in production for over a year, with
> parsing runs taking place every few weeks. The same application runs into
> insufficient memory problems (java heap) when using Tika 1.13.
> I have used lsof and file leak detector to track down open files, however
> neither shows any open files when the application is running. I did find an
> issue with open files https://issues.apache.org/jira/browse/TIKA-2015,
> however there was a workaround for this and this is not the issue.
> I am sorry to have to report this with a level of vagueness, but with lsof
> turning nothing up I am a bit stuck as to how to investigate further. We are
> more than willing to help by testing on the basis of any ideas provided.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)