[ 
https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15486733#comment-15486733
 ] 

Tim Barrett commented on TIKA-2058:
-----------------------------------

I will add logging to our app - logging every file that is put through the mill 
- that will be a lot of logging :-) We don't currently log each leaf file, just 
the high level files that act as containers - being as how there are 7,000,000+ 
leaf files.

Your theory is that a single file is causing this problem? I'm not sure that 
logging will identify the file, it is more likely that certain files result in 
non GC able memory being added incrementally, which at some point lead to the 
OOM as opposed to a single file immediately causing the OOM. Our app is itself 
very memory intensive - lots of large caches (managed I hasten to add hence the 
fact that up to 1.13 all works fine), so as I say the offending POI code 
(assuming that is the problem) will be more of an insidious thing that will add 
to an already heavily used heap rather than certain files leading immediately 
to an OOM error. I can add any code you guys recommend to help us resolve this 
- there's nothing like a good stress test!

I can't let you have the source files themselves as they are confidential 
customer files.

> Memory Leak in Tika version 1.13 when parsing millions of files
> ---------------------------------------------------------------
>
>                 Key: TIKA-2058
>                 URL: https://issues.apache.org/jira/browse/TIKA-2058
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Tim Barrett
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> We have an application using Tika which parses roughly 7,000,000 files of 
> different types, many of the files are MSG files with attachments. This works 
> correctly with Tika 1.9, and has been in production for over a year,  with 
> parsing runs taking place every few weeks. The same application runs into 
> insufficient memory problems (java heap) when using Tika 1.13.
> I have used lsof and file leak detector to track down open files, however 
> neither shows any open files when the application is running. I did find an 
> issue with open files https://issues.apache.org/jira/browse/TIKA-2015, 
> however there was a workaround for this and this is not the issue.
> I am sorry to have to report this with a level of vagueness, but with lsof 
> turning nothing up I am a bit stuck as to how to investigate further. We are 
> more than willing to help by testing on the basis of any ideas provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to