Joshua Hight created TIKA-2244:
----------------------------------
Summary: excessive memory usage when parsing a large nested
package file
Key: TIKA-2244
URL: https://issues.apache.org/jira/browse/TIKA-2244
Project: Tika
Issue Type: Bug
Components: core, parser
Affects Versions: 2.0
Reporter: Joshua Hight
Priority: Minor
When parsing large nested files(a couple good examples are maven jars and git
objects), a large number of BufferedInputStreams get generated taking up large
amounts of memory with their buffers. Upon looking through the relevant code I
saw that many of these allocations were coming from
TikaInputStream.get(InputStream, TemporaryResources)
which checks if the InputStream is a BufferedInputStream or
ByteArrayInputStream in order to determine whether on not mark is supported.
Unfortunately it is common practice to wrap InputStreams in
CloseShieldInputStreams, causing it to fail even if mark is in fact supported.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)