Tim Allison created TIKA-4623:
---------------------------------

             Summary: Improve rewind performance on generic InputStreams in 4.x
                 Key: TIKA-4623
                 URL: https://issues.apache.org/jira/browse/TIKA-4623
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


On TIKA-4619, we made TikaInputStream rewindable. The benefits of rewindability:
 * Can go beyond 2gb.
 * Does not interfere with parser/detector needs to mark reset at non-zero 
offsets

There are now three types of backing inputstream that are used by 
TikaInputStream: file, bytearray, generic. With generic, we buffer to memory 
and then spool to disk at a certain threshold.

The one downside with this setup is that we're buffering to memory for the 
generic inputstream when mark/reset might be sufficient.

On this ticket, we'll look into adding an "enableRewind()" call in 
TikaInputStream. This would be a no-op for file and bytearray backed streams 
(because those are already rewindable). But what it would do is allow for basic 
BufferedInputStream for most file formats that require only that and for which 
we do not need rewindability. This would put the responsibility on the 
digester/detector/parser to know when an inputstream needs to be rewindable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to