Joshua Hight created TIKA-2244:
----------------------------------

             Summary: excessive memory usage when parsing a large nested 
package file
                 Key: TIKA-2244
                 URL: https://issues.apache.org/jira/browse/TIKA-2244
             Project: Tika
          Issue Type: Bug
          Components: core, parser
    Affects Versions: 2.0
            Reporter: Joshua Hight
            Priority: Minor


When parsing large nested files(a couple good examples are maven jars and git 
objects), a large number of BufferedInputStreams get generated taking up large 
amounts of memory with their buffers. Upon looking through the relevant code I 
saw that many of these allocations were coming from 
TikaInputStream.get(InputStream, TemporaryResources)
which checks if the InputStream is a BufferedInputStream or 
ByteArrayInputStream in order to determine whether on not mark is supported. 
Unfortunately it is common practice to wrap InputStreams in 
CloseShieldInputStreams, causing it to fail even if mark is in fact supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to