Tim Allison created TIKA-4623:
---------------------------------
Summary: Improve rewind performance on generic InputStreams in 4.x
Key: TIKA-4623
URL: https://issues.apache.org/jira/browse/TIKA-4623
Project: Tika
Issue Type: Task
Reporter: Tim Allison
On TIKA-4619, we made TikaInputStream rewindable. The benefits of rewindability:
* Can go beyond 2gb.
* Does not interfere with parser/detector needs to mark reset at non-zero
offsets
There are now three types of backing inputstream that are used by
TikaInputStream: file, bytearray, generic. With generic, we buffer to memory
and then spool to disk at a certain threshold.
The one downside with this setup is that we're buffering to memory for the
generic inputstream when mark/reset might be sufficient.
On this ticket, we'll look into adding an "enableRewind()" call in
TikaInputStream. This would be a no-op for file and bytearray backed streams
(because those are already rewindable). But what it would do is allow for basic
BufferedInputStream for most file formats that require only that and for which
we do not need rewindability. This would put the responsibility on the
digester/detector/parser to know when an inputstream needs to be rewindable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)