[
https://issues.apache.org/jira/browse/TIKA-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2244.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.15
2.0
Updated AutoDetectReader. Thank you for opening this.
Out of curiosity, if you can share any metrics on decreased memory consumption
that this change helped with, that'd be great! Did you use hprof or other
memory profiling tool to see a difference btwn before/after this change?
> excessive memory usage when parsing a large nested package file
> ---------------------------------------------------------------
>
> Key: TIKA-2244
> URL: https://issues.apache.org/jira/browse/TIKA-2244
> Project: Tika
> Issue Type: Bug
> Components: core, parser
> Affects Versions: 2.0
> Reporter: Joshua Hight
> Priority: Minor
> Fix For: 2.0, 1.15
>
>
> When parsing large nested files(a couple good examples are maven jars and git
> objects), a large number of BufferedInputStreams get generated taking up
> large amounts of memory with their buffers. Upon looking through the relevant
> code I saw that many of these allocations were coming from
> TikaInputStream.get(InputStream, TemporaryResources)
> which checks if the InputStream is a BufferedInputStream or
> ByteArrayInputStream in order to determine whether on not mark is supported.
> Unfortunately it is common practice to wrap InputStreams in
> CloseShieldInputStreams, causing it to fail even if mark is in fact supported.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)