[jira] [Comment Edited] (TIKA-3919) Out of Memory during file parsing in AutoDetectParser

Narendran Solai Sridharan (Jira) Tue, 08 Nov 2022 22:29:05 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630775#comment-17630775
 ]


Narendran Solai Sridharan edited comment on TIKA-3919 at 11/9/22 6:28 AM:
--------------------------------------------------------------------------

Unfortunately, local files are not parsed, streaming files are being parsed. 

The issue is only occurring in DRM protected DOC, XLS and PPT files. If the 
same document is unrestricted by DRM then load testing works fine, in such a 
case even the file size did not matter, large file also could be parsed and 
indexed properly. DRM protected files alone are creating OOM issue even a small 
sized one around 80 to 90 KB.

Trying to set "markLimit" via configuration programmatically, will update my 
findings.


was (Author: narendranss):
Unfortunately, local files are not parsed, streaming files are being parsed. 

The issue is only occurring in DRM protected DOC, XLS and PPT files. If the 
same document is unrestricted by DRM then load testing works fine, in such a 
case even the file size did not matter, large file also could be parsed and 
indexed properly. DRM protected files alone are creating OOM issue.

Trying to set "markLimit" via configuration programmatically, will update my 
findings.

> Out of Memory during file parsing in AutoDetectParser
> -----------------------------------------------------
>
>                 Key: TIKA-3919
>                 URL: https://issues.apache.org/jira/browse/TIKA-3919
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser, tika-core
>    Affects Versions: 2.4.1
>         Environment: OS : Windows 10,
> Software Platform : Java
>  
>  
>            Reporter: Narendran Solai Sridharan
>            Priority: Major
>         Attachments: Large Object-1.PNG, Model.xlsx, Thread dump.PNG
>
>
> Out of Memory during file parsing in AutoDetectParser. Issue is occurring in 
> almost all newly created Microsoft Documents while parsing documents in 
> parallel in different threads, seems there is an issue in parsing new 
> documents :(
> java.lang.OutOfMemoryError: Java heap space
>     at 
> org.apache.tika.io.LookaheadInputStream.<init>(LookaheadInputStream.java:66)
>     at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:683)
>     at 
> org.apache.tika.detect.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:467)
>     at 
> org.apache.tika.detect.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:530)
>     at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:85)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:142)
> While testing load in our existing environment, which has been upgraded from 
> tika version 1.28.1 to 2.4.1. 
> The following file which is almost empty [^Model.xlsx] had been parsed via 
> client program multiple times via JMeter. Seems, we are getting Out of Memory 
> due to a limit set "markLimit = 134217728", but not sure.
>  
> !Thread dump.PNG!
>  
> !Large Object-1.PNG!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-3919) Out of Memory during file parsing in AutoDetectParser

Reply via email to