[
https://issues.apache.org/jira/browse/TIKA-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629993#comment-17629993
]
Tim Allison commented on TIKA-3919:
-----------------------------------
You can set the markLimit with this:
{noformat}
<properties>
<detectors>
<detector class="org.gagravarr.tika.OggDetector"/>
<detector class="org.apache.tika.detect.apple.BPListDetector"/>
<detector class="org.apache.tika.detect.microsoft.POIFSContainerDetector">
<params>
<param name="markLimit" type="int">10000</param>
</params>
</detector>
<detector class="org.apache.tika.detect.ole.MiscOLEDetector"/>
<detector class="org.apache.tika.detect.zip.DefaultZipContainerDetector"/>
<detector class="org.apache.tika.mime.MimeTypes"/>
</detectors>
</properties>
{noformat}
> Out of Memory during file parsing in AutoDetectParser
> -----------------------------------------------------
>
> Key: TIKA-3919
> URL: https://issues.apache.org/jira/browse/TIKA-3919
> Project: Tika
> Issue Type: Bug
> Components: detector, parser, tika-core
> Affects Versions: 2.4.1
> Environment: OS : Windows 10,
> Software Platform : Java
>
>
> Reporter: Narendran Solai Sridharan
> Priority: Major
> Attachments: Large Object-1.PNG, Model.xlsx, Thread dump.PNG
>
>
> Out of Memory during file parsing in AutoDetectParser. Issue is occurring in
> almost all newly created Microsoft Documents while parsing documents in
> parallel in different threads, seems there is an issue in parsing new
> documents :(
> java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.tika.io.LookaheadInputStream.<init>(LookaheadInputStream.java:66)
> at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:683)
> at
> org.apache.tika.detect.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:467)
> at
> org.apache.tika.detect.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:530)
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:85)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:142)
> While testing load in our existing environment, which has been upgraded from
> tika version 1.28.1 to 2.4.1.
> The following file which is almost empty [^Model.xlsx] had been parsed via
> client program multiple times via JMeter. Seems, we are getting Out of Memory
> due to a limit set "markLimit = 134217728", but not sure.
>
> !Thread dump.PNG!
>
> !Large Object-1.PNG!
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)