[jira] [Commented] (TIKA-3919) Out of Memory during file parsing in AutoDetectParser

Tim Allison (Jira) Mon, 07 Nov 2022 12:01:11 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629999#comment-17629999
 ]


Tim Allison commented on TIKA-3919:
-----------------------------------

As I look at the code, I think we should use the BoundedInputStream instead of 
the LookaheadInputStream.  We should still allow users to set a {{markLimit}}.

Again, if you are parsing a local file, I strongly encourage opening the 
InputStream with TikaInputStream.get(path).  

In our current code, if a user submits a TikaInputStream initialized with an 
InputStream (e.g. not a file), our POIFSContainerDetector tries to copy the 
stream to disk (up to markLimit) and MiscOLEDetector tries to copy it to disk 
(up to markLimit), and then the file is finally parsed.

If a user submits a TikaInputStream from a file/path, the 
POIFSContainerDetector loads the POIFSFileSystem object into the 
TikaInputStream, and then it is reused by the MiscOLEDetector and the parser.

> Out of Memory during file parsing in AutoDetectParser
> -----------------------------------------------------
>
>                 Key: TIKA-3919
>                 URL: https://issues.apache.org/jira/browse/TIKA-3919
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser, tika-core
>    Affects Versions: 2.4.1
>         Environment: OS : Windows 10,
> Software Platform : Java
>  
>  
>            Reporter: Narendran Solai Sridharan
>            Priority: Major
>         Attachments: Large Object-1.PNG, Model.xlsx, Thread dump.PNG
>
>
> Out of Memory during file parsing in AutoDetectParser. Issue is occurring in 
> almost all newly created Microsoft Documents while parsing documents in 
> parallel in different threads, seems there is an issue in parsing new 
> documents :(
> java.lang.OutOfMemoryError: Java heap space
>     at 
> org.apache.tika.io.LookaheadInputStream.<init>(LookaheadInputStream.java:66)
>     at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:683)
>     at 
> org.apache.tika.detect.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:467)
>     at 
> org.apache.tika.detect.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:530)
>     at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:85)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:142)
> While testing load in our existing environment, which has been upgraded from 
> tika version 1.28.1 to 2.4.1. 
> The following file which is almost empty [^Model.xlsx] had been parsed via 
> client program multiple times via JMeter. Seems, we are getting Out of Memory 
> due to a limit set "markLimit = 134217728", but not sure.
>  
> !Thread dump.PNG!
>  
> !Large Object-1.PNG!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3919) Out of Memory during file parsing in AutoDetectParser

Reply via email to