[ 
https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484222#comment-15484222
 ] 

Tim Barrett commented on TIKA-2058:
-----------------------------------

no it’s lower level than that - whether primary or embedded they all end up in 
the same place, this is where I commented it out.

        if (processMsgAttachments) {

                        resourceFileInputStream = 
NalandaFileSystemUtilities.getInputStream(resourceFilePath,
                                        
parentResourceSet.getNalyticsProperties());

                        // resourceFileInputStream = 
Files.newInputStream(resourceFilePath);

                        try {

                                MAPIMessage mapiMessage = new 
MAPIMessage(resourceFileInputStream);

                                AttachmentChunks attachments[] = 
mapiMessage.getAttachmentFiles();

                                if (attachments.length > 0) {

                                        for (AttachmentChunks attachment : 
attachments) {

                                                if 
(!attachment.isEmbeddedMessage()) {

                                                        
processFileEmbeddedInMsg(msgGranule, resourceFilePath, parentResourceSet, 
attachment);

                                                } else {

                                                        
processMsgEmbeddedInMsg(msgGranule, resourceFilePath, parentResourceSet, 
attachment);

                                                }

                                        }

                                }

                        } finally {

                                if (resourceFileInputStream != null) {

                                        resourceFileInputStream.close();

                                }

                        }

                }


and then deeper down all files end up in the same place to be parsed and this 
is where the pdf exclusion comes in. So the attachments do get opened and 
written to disk - including pdfs - but only from their MAPIMessage obtained via 
mapiMessage = new MAPIMessage(resourceFileInputStream), which then gives us the 
attachments, which are then written to disk without looking at what they are. 

There was a problem with this code that I also reported which led to file 
handles being left open, which happened if I used MAPIMessage mapiMessage = new 
MAPIMessage(String filename). using an input stream from the file solved that, 
and I do not see any open files since using the input stream. Could these be 
related however - could there be other changes that have been made in the poi 
libraries that are related to the OOM problem? The MAPIMessage constructor with 
the string file name was never a problem before TIKA 1-13.






> Memory Leak in Tika version 1.13 when parsing millions of files
> ---------------------------------------------------------------
>
>                 Key: TIKA-2058
>                 URL: https://issues.apache.org/jira/browse/TIKA-2058
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Tim Barrett
>
> We have an application using Tika which parses roughly 7,000,000 files of 
> different types, many of the files are MSG files with attachments. This works 
> correctly with Tika 1.9, and has been in production for over a year,  with 
> parsing runs taking place every few weeks. The same application runs into 
> insufficient memory problems (java heap) when using Tika 1.13.
> I have used lsof and file leak detector to track down open files, however 
> neither shows any open files when the application is running. I did find an 
> issue with open files https://issues.apache.org/jira/browse/TIKA-2015, 
> however there was a workaround for this and this is not the issue.
> I am sorry to have to report this with a level of vagueness, but with lsof 
> turning nothing up I am a bit stuck as to how to investigate further. We are 
> more than willing to help by testing on the basis of any ideas provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to