This is an automated email from the ASF dual-hosted git repository.
tallison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 5f9a808ac3 TIKA-4623 -- for general updates, don't buffer unless
enableRewind has been set (#2534)
5f9a808ac3 is described below
commit 5f9a808ac316ca09699484d207e66614ed7fef5e
Author: Tim Allison <[email protected]>
AuthorDate: Thu Jan 15 12:45:32 2026 -0500
TIKA-4623 -- for general updates, don't buffer unless enableRewind has been
set (#2534)
---
docs/spooling.adoc | 72 ++++-----
.../java/org/apache/tika/io/ByteArraySource.java | 26 ++++
.../org/apache/tika/io/CachingInputStream.java | 19 ++-
.../java/org/apache/tika/io/CachingSource.java | 170 ++++++++++++++++++---
.../main/java/org/apache/tika/io/FileSource.java | 26 ++++
.../java/org/apache/tika/io/TikaInputSource.java | 13 ++
.../java/org/apache/tika/io/TikaInputStream.java | 47 +++++-
.../parser/multiple/AbstractMultipleParser.java | 3 +
.../org/apache/tika/io/TikaInputStreamTest.java | 160 ++++++++++++++++++-
.../org/apache/tika/parser/crypto/TSDParser.java | 2 +
.../org/apache/tika/parser/pkg/PackageParser.java | 4 +-
.../org/apache/tika/zip/utils/ZipSalvager.java | 2 +
12 files changed, 476 insertions(+), 68 deletions(-)
diff --git a/docs/spooling.adoc b/docs/spooling.adoc
index 3385fd23a4..81d3bb18e4 100644
--- a/docs/spooling.adoc
+++ b/docs/spooling.adoc
@@ -26,44 +26,24 @@
=== What is Spooling?
Spooling refers to the process of writing an input stream to a temporary file
on disk.
-This is necessary for certain file formats that require random access to the
underlying
-bytes during detection or parsing.
+This benefits certain file formats that can be processed more efficiently with
random access
+to the underlying bytes during detection or parsing.
=== Why Some Formats Benefit from Random Access
Several file formats are most efficiently processed with random access vs
streaming:
-* **OLE2 (Microsoft Office legacy formats)**: The POI library needs to read
the file
+* **OLE2 (Microsoft Office legacy formats)**: The POI library benefits from
reading the file
as a random-access structure to navigate the OLE2 container.
-* **ZIP-based formats**: Container detection requires reading the ZIP central
directory,
- which is located at the end of the file.
-* **Binary Property Lists (bplist)**: Apple's binary plist format requires
random access
+* **ZIP-based formats**: Container detection benefits from reading the ZIP
central directory,
+ which is located at the end of the file. Parsing also benefits from random
access.
+* **Binary Property Lists (bplist)**: Apple's binary plist format benefits
from random access
for efficient parsing.
-* **PDF**: While detection works via magic bytes, parsing requires random
access for
+* **PDF**: While detection works via magic bytes, parsing benefits from random
access for
the PDF cross-reference table.
=== Architectural Decision: Decentralized Spooling
-==== The Problem with Centralized Spooling
-
-Earlier versions of Tika considered centralizing spooling decisions in
`DefaultDetector`.
-The detector would check the detected media type and spool to disk before
passing the
-stream to specialized detectors or parsers.
-
-This approach had several drawbacks:
-
-1. **Unnecessary spooling**: PDF files need spooling for _parsing_ but not for
_detection_
- (magic bytes suffice). Centralized detection-time spooling would spool PDFs
unnecessarily
- when only detecting.
-
-2. **Redundant logic**: Specialized detectors like `POIFSContainerDetector` and
- `DefaultZipContainerDetector` already call `TikaInputStream.getFile()` or
`getPath()`
- when they need random access. They know best when spooling is required.
-
-3. **Coupling**: Centralized spooling couples the detector to knowledge about
which
- formats need random access, duplicating logic that already exists in
specialized
- components.
-
==== The Solution: Let Components Self-Spool
The current architecture follows a simple principle: **each component that
needs random
@@ -78,11 +58,13 @@ Path path = TikaInputStream.get(inputStream).getPath();
File file = TikaInputStream.get(inputStream).getFile();
----
-`TikaInputStream` handles the spooling transparently:
+`TikaInputStream` handles the spooling transparently based on how it was
initialized:
-* If the stream is already backed by a file, it returns that file directly.
-* If the stream is in-memory or network-based, it spools to a temporary file.
-* The temporary file is automatically cleaned up when the stream is closed.
+* **Initialized with `Path`**: The file is used directly for random access. No
spooling needed.
+* **Initialized with `byte[]`**: The bytes are kept in memory. Spooling only
on demand.
+* **Initialized with `InputStream`**: When `getPath()` or `getFile()` is
called, the stream
+ is dynamically buffered to memory first, then spills to a temporary file
after a threshold.
+ The temporary file is automatically cleaned up when the stream is closed.
==== Benefits of Decentralized Spooling
@@ -105,7 +87,7 @@ file management. This means:
=== Default Behavior
By default, Tika handles spooling automatically. You don't need to configure
anything
-for most use cases. When a detector or parser needs random access to a file,
it will
+for most use cases. When a detector or parser benefits from random access to a
file, it will
spool the input stream to a temporary file if necessary.
=== SpoolingStrategy for Fine-Grained Control
@@ -113,7 +95,7 @@ spool the input stream to a temporary file if necessary.
For advanced use cases, you can use `SpoolingStrategy` to control spooling
behavior.
This is useful when you want to:
-* Restrict which file types are allowed to spool (e.g., for security reasons)
+* Restrict which file types are allowed to spool (e.g., for performance
reasons)
* Customize spooling behavior based on metadata or stream properties
==== Programmatic Configuration
@@ -211,8 +193,26 @@ context.set(SpoolingStrategy.class, strategy);
1. **Let Tika handle it**: For most applications, the default behavior is
optimal.
Don't configure spooling unless you have a specific need.
-2. **Use TikaInputStream**: Always wrap your input streams with
`TikaInputStream`
- to enable efficient spooling and rewind capabilities.
+2. **Use TikaInputStream with Path or byte[]**: When you have a file, pass the
`Path`
+ directly to `TikaInputStream.get(Path)` rather than wrapping a
`FileInputStream`.
+ Similarly, pass `byte[]` directly rather than wrapping a
`ByteArrayInputStream`.
+ This allows TikaInputStream to use efficient backing strategies that avoid
unnecessary
+ copying or spooling:
++
+[source,java]
+----
+// Good: TikaInputStream knows it has a file, can use random access directly
+TikaInputStream tis = TikaInputStream.get(path);
+
+// Bad: TikaInputStream sees an opaque stream, may spool unnecessarily
+TikaInputStream tis = TikaInputStream.get(new FileInputStream(file));
+
+// Good: TikaInputStream knows it has bytes in memory
+TikaInputStream tis = TikaInputStream.get(bytes);
+
+// Bad: TikaInputStream sees an opaque stream
+TikaInputStream tis = TikaInputStream.get(new ByteArrayInputStream(bytes));
+----
3. **Close streams properly**: Use try-with-resources to ensure temporary files
are cleaned up:
@@ -225,5 +225,5 @@ try (TikaInputStream tis =
TikaInputStream.get(inputStream)) {
----
4. **Consider memory vs. disk tradeoffs**: For very large files, spooling to
disk
- is necessary. For small files processed in bulk, keeping data in memory may
be
+ may be needed. For small files processed in bulk, keeping data in memory
may be
faster. `TikaInputStream` backing strategies can be tuned for your workload.
diff --git a/tika-core/src/main/java/org/apache/tika/io/ByteArraySource.java
b/tika-core/src/main/java/org/apache/tika/io/ByteArraySource.java
index 3d19a04a5d..a9dcd8da96 100644
--- a/tika-core/src/main/java/org/apache/tika/io/ByteArraySource.java
+++ b/tika-core/src/main/java/org/apache/tika/io/ByteArraySource.java
@@ -112,4 +112,30 @@ class ByteArraySource extends InputStream implements
TikaInputSource {
public long getLength() {
return length;
}
+
+ @Override
+ public void enableRewind() {
+ // No-op: byte array is always rewindable
+ }
+
+ // Mark/reset support using position tracking
+ private int markPosition = -1;
+
+ @Override
+ public synchronized void mark(int readlimit) {
+ markPosition = position;
+ }
+
+ @Override
+ public synchronized void reset() throws IOException {
+ if (markPosition < 0) {
+ throw new IOException("Mark not set");
+ }
+ position = markPosition;
+ }
+
+ @Override
+ public boolean markSupported() {
+ return true;
+ }
}
diff --git a/tika-core/src/main/java/org/apache/tika/io/CachingInputStream.java
b/tika-core/src/main/java/org/apache/tika/io/CachingInputStream.java
index 3cbd859f31..b9474b1cae 100644
--- a/tika-core/src/main/java/org/apache/tika/io/CachingInputStream.java
+++ b/tika-core/src/main/java/org/apache/tika/io/CachingInputStream.java
@@ -192,9 +192,24 @@ class CachingInputStream extends InputStream {
return source.available();
}
+ // Mark/reset support using seekTo
+ private long markPosition = -1;
+
+ @Override
+ public synchronized void mark(int readlimit) {
+ markPosition = position;
+ }
+
+ @Override
+ public synchronized void reset() throws IOException {
+ if (markPosition < 0) {
+ throw new IOException("Mark not set");
+ }
+ seekTo(markPosition);
+ }
+
@Override
public boolean markSupported() {
- // Mark/reset is handled at the TikaInputStream level
- return false;
+ return true;
}
}
diff --git a/tika-core/src/main/java/org/apache/tika/io/CachingSource.java
b/tika-core/src/main/java/org/apache/tika/io/CachingSource.java
index baf38c7cd8..07c6f0fdc3 100644
--- a/tika-core/src/main/java/org/apache/tika/io/CachingSource.java
+++ b/tika-core/src/main/java/org/apache/tika/io/CachingSource.java
@@ -25,55 +25,94 @@ import java.nio.file.Path;
import org.apache.commons.io.IOUtils;
/**
- * Input source that caches bytes from a raw InputStream.
+ * Input source that wraps a raw InputStream with optional caching.
* <p>
- * Uses {@link CachingInputStream} to cache bytes as they are read,
- * enabling mark/reset/seek operations. If the cache exceeds a threshold,
- * it spills to a temporary file via {@link StreamCache}.
+ * Starts in passthrough mode using {@link BufferedInputStream} for basic
+ * mark/reset support. When {@link #enableRewind()} is called (at position 0),
+ * switches to caching mode using {@link CachingInputStream} which enables
+ * full rewind/seek capability.
+ * <p>
+ * If caching is not enabled, {@link #seekTo(long)} will fail for any position
+ * other than the current position.
*/
class CachingSource extends InputStream implements TikaInputSource {
private final TemporaryResources tmp;
- private CachingInputStream cachingStream;
private long length;
+ // Passthrough mode: just a BufferedInputStream
+ private BufferedInputStream passthroughStream;
+ private long passthroughPosition;
+
+ // Caching mode: CachingInputStream for full rewind support
+ private CachingInputStream cachingStream;
+
// After spilling to file, we switch to file-backed mode
private Path spilledPath;
private InputStream fileStream;
+ private long filePosition; // Track position in file mode
CachingSource(InputStream source, TemporaryResources tmp, long length) {
this.tmp = tmp;
this.length = length;
- StreamCache cache = new StreamCache(tmp);
- this.cachingStream = new CachingInputStream(
- source instanceof BufferedInputStream ? source : new
BufferedInputStream(source),
- cache
- );
+ // Start in passthrough mode
+ this.passthroughStream = source instanceof BufferedInputStream
+ ? (BufferedInputStream) source
+ : new BufferedInputStream(source);
+ this.passthroughPosition = 0;
}
@Override
public int read() throws IOException {
if (fileStream != null) {
- return fileStream.read();
+ int b = fileStream.read();
+ if (b != -1) {
+ filePosition++;
+ }
+ return b;
}
- return cachingStream.read();
+ if (cachingStream != null) {
+ return cachingStream.read();
+ }
+ int b = passthroughStream.read();
+ if (b != -1) {
+ passthroughPosition++;
+ }
+ return b;
}
@Override
public int read(byte[] b, int off, int len) throws IOException {
if (fileStream != null) {
- return fileStream.read(b, off, len);
+ int n = fileStream.read(b, off, len);
+ if (n > 0) {
+ filePosition += n;
+ }
+ return n;
+ }
+ if (cachingStream != null) {
+ return cachingStream.read(b, off, len);
+ }
+ int n = passthroughStream.read(b, off, len);
+ if (n > 0) {
+ passthroughPosition += n;
}
- return cachingStream.read(b, off, len);
+ return n;
}
@Override
public long skip(long n) throws IOException {
if (fileStream != null) {
- return IOUtils.skip(fileStream, n);
+ long skipped = IOUtils.skip(fileStream, n);
+ filePosition += skipped;
+ return skipped;
}
- //this is safe because cachingStream already does a read instead of
trusting the skip
- return cachingStream.skip(n);
+ if (cachingStream != null) {
+ return cachingStream.skip(n);
+ }
+ long skipped = IOUtils.skip(passthroughStream, n);
+ passthroughPosition += skipped;
+ return skipped;
}
@Override
@@ -81,7 +120,74 @@ class CachingSource extends InputStream implements
TikaInputSource {
if (fileStream != null) {
return fileStream.available();
}
- return cachingStream.available();
+ if (cachingStream != null) {
+ return cachingStream.available();
+ }
+ return passthroughStream.available();
+ }
+
+ // Track mark position across all modes
+ private long markPosition = -1;
+
+ @Override
+ public synchronized void mark(int readlimit) {
+ if (fileStream != null) {
+ // File mode - track position for seekTo-based reset
+ markPosition = filePosition;
+ return;
+ }
+ if (cachingStream != null) {
+ // Caching mode - track position for seekTo-based reset
+ markPosition = cachingStream.getPosition();
+ return;
+ }
+ // Passthrough mode - delegate to BufferedInputStream
+ passthroughStream.mark(readlimit);
+ markPosition = passthroughPosition;
+ }
+
+ @Override
+ public synchronized void reset() throws IOException {
+ if (markPosition < 0) {
+ throw new IOException("Mark not set");
+ }
+ if (fileStream != null) {
+ // File mode - use seekTo
+ seekTo(markPosition);
+ return;
+ }
+ if (cachingStream != null) {
+ // Caching mode - use seekTo
+ cachingStream.seekTo(markPosition);
+ return;
+ }
+ // Passthrough mode - delegate to BufferedInputStream
+ passthroughStream.reset();
+ passthroughPosition = markPosition;
+ }
+
+ @Override
+ public boolean markSupported() {
+ return true;
+ }
+
+ @Override
+ public void enableRewind() {
+ // Already in caching or file mode - no-op
+ if (cachingStream != null || fileStream != null) {
+ return;
+ }
+
+ if (passthroughPosition != 0) {
+ throw new IllegalStateException(
+ "Cannot enable rewind: position is " + passthroughPosition
+
+ ", must be 0. Call enableRewind() before
reading.");
+ }
+
+ // Switch to caching mode
+ StreamCache cache = new StreamCache(tmp);
+ cachingStream = new CachingInputStream(passthroughStream, cache);
+ passthroughStream = null;
}
@Override
@@ -93,8 +199,20 @@ class CachingSource extends InputStream implements
TikaInputSource {
if (position > 0) {
IOUtils.skipFully(fileStream, position);
}
- } else {
+ filePosition = position;
+ return;
+ }
+
+ if (cachingStream != null) {
cachingStream.seekTo(position);
+ return;
+ }
+
+ // Passthrough mode - can only "seek" to current position
+ if (position != passthroughPosition) {
+ throw new IOException(
+ "Cannot seek in passthrough mode. Call enableRewind()
first. " +
+ "Current position: " + passthroughPosition + ",
requested: " + position);
}
}
@@ -106,6 +224,16 @@ class CachingSource extends InputStream implements
TikaInputSource {
@Override
public Path getPath(TemporaryResources tmp, String suffix) throws
IOException {
if (spilledPath == null) {
+ // If still in passthrough mode, enable caching first
+ if (cachingStream == null) {
+ if (passthroughPosition != 0) {
+ throw new IOException(
+ "Cannot spill to file: position is " +
passthroughPosition +
+ ", must be 0. Call enableRewind() before
reading if you need file access.");
+ }
+ enableRewind();
+ }
+
// Spill to file and switch to file-backed mode
spilledPath = cachingStream.spillToFile(suffix);
@@ -120,6 +248,7 @@ class CachingSource extends InputStream implements
TikaInputSource {
if (currentPosition > 0) {
IOUtils.skipFully(fileStream, currentPosition);
}
+ filePosition = currentPosition;
// Update length from file size
long fileSize = Files.size(spilledPath);
@@ -145,5 +274,8 @@ class CachingSource extends InputStream implements
TikaInputSource {
if (cachingStream != null) {
cachingStream.close();
}
+ if (passthroughStream != null) {
+ passthroughStream.close();
+ }
}
}
diff --git a/tika-core/src/main/java/org/apache/tika/io/FileSource.java
b/tika-core/src/main/java/org/apache/tika/io/FileSource.java
index e89690a086..95f6458574 100644
--- a/tika-core/src/main/java/org/apache/tika/io/FileSource.java
+++ b/tika-core/src/main/java/org/apache/tika/io/FileSource.java
@@ -118,4 +118,30 @@ class FileSource extends InputStream implements
TikaInputSource {
currentStream.close();
}
}
+
+ @Override
+ public void enableRewind() {
+ // No-op: file is always rewindable
+ }
+
+ // Mark/reset support using seekTo
+ private long markPosition = -1;
+
+ @Override
+ public synchronized void mark(int readlimit) {
+ markPosition = position;
+ }
+
+ @Override
+ public synchronized void reset() throws IOException {
+ if (markPosition < 0) {
+ throw new IOException("Mark not set");
+ }
+ seekTo(markPosition);
+ }
+
+ @Override
+ public boolean markSupported() {
+ return true;
+ }
}
diff --git a/tika-core/src/main/java/org/apache/tika/io/TikaInputSource.java
b/tika-core/src/main/java/org/apache/tika/io/TikaInputSource.java
index 10e2b52dd8..7a8da5d703 100644
--- a/tika-core/src/main/java/org/apache/tika/io/TikaInputSource.java
+++ b/tika-core/src/main/java/org/apache/tika/io/TikaInputSource.java
@@ -53,4 +53,17 @@ interface TikaInputSource extends Closeable {
* Returns the length of the content, or -1 if unknown.
*/
long getLength();
+
+ /**
+ * Enables full rewind capability.
+ * <p>
+ * For ByteArraySource and FileSource, this is a no-op (always rewindable).
+ * For CachingSource, this switches from passthrough mode to caching mode,
+ * enabling subsequent {@link #seekTo(long)} and rewind operations.
+ * <p>
+ * Must be called when position is 0, otherwise throws
IllegalStateException.
+ *
+ * @throws IllegalStateException if position is not 0
+ */
+ void enableRewind();
}
diff --git a/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java
b/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java
index c87a5ea09d..7eee791e3a 100644
--- a/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java
+++ b/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java
@@ -299,12 +299,8 @@ public class TikaInputStream extends TaggedInputStream {
throw new IOException("Resetting to invalid mark");
}
- TikaInputSource source = inputSource();
- if (source != null) {
- source.seekTo(mark);
- } else {
- throw new IOException("Cannot reset: no TikaInputSource
available");
- }
+ // Delegate to underlying stream's reset (handles passthrough and
caching modes)
+ super.reset();
position = mark;
// Don't invalidate mark - allow multiple reset() calls to same mark
@@ -447,11 +443,46 @@ public class TikaInputStream extends TaggedInputStream {
/**
* Rewind the stream to the beginning.
+ * <p>
+ * For streams created from byte arrays or files, this always works.
+ * For streams created from raw InputStreams, this requires
+ * {@link #enableRewind()} to have been called first.
*/
public void rewind() throws IOException {
- mark = 0;
- reset();
+ TikaInputSource source = inputSource();
+ if (source != null) {
+ source.seekTo(0);
+ } else {
+ throw new IOException("Cannot rewind: no TikaInputSource
available");
+ }
+ position = 0;
mark = -1;
+ consecutiveEOFs = 0;
+ }
+
+ /**
+ * Enables full rewind capability for this stream.
+ * <p>
+ * For streams backed by byte arrays or files, this is a no-op since they
+ * are inherently rewindable. For streams backed by raw InputStreams, this
+ * switches from passthrough mode to caching mode, enabling subsequent
+ * {@link #rewind()}, {@link #mark(int)}/{@link #reset()}, and random
access.
+ * <p>
+ * Must be called when position is 0 (before any reading), otherwise
+ * throws IllegalStateException.
+ * <p>
+ * Use this method when you know you'll need to rewind the stream later
+ * (e.g., for detection followed by parsing, or digest calculation).
+ * For streaming-only operations (e.g., HTML parsing), skip this call
+ * to avoid unnecessary caching overhead.
+ *
+ * @throws IllegalStateException if position is not 0
+ */
+ public void enableRewind() {
+ TikaInputSource source = inputSource();
+ if (source != null) {
+ source.enableRewind();
+ }
}
@Override
diff --git
a/tika-core/src/main/java/org/apache/tika/parser/multiple/AbstractMultipleParser.java
b/tika-core/src/main/java/org/apache/tika/parser/multiple/AbstractMultipleParser.java
index 2cf757db2b..17eff0515e 100644
---
a/tika-core/src/main/java/org/apache/tika/parser/multiple/AbstractMultipleParser.java
+++
b/tika-core/src/main/java/org/apache/tika/parser/multiple/AbstractMultipleParser.java
@@ -244,6 +244,9 @@ public abstract class AbstractMultipleParser implements
Parser {
private void parse(TikaInputStream tis, ContentHandler handler,
ContentHandlerFactory handlerFactory, Metadata
originalMetadata,
ParseContext context) throws IOException, SAXException,
TikaException {
+ // Enable rewind capability since we rewind between multiple parsers
+ tis.enableRewind();
+
// Track the metadata between parsers, so we can apply our policy
Metadata lastMetadata = cloneMetadata(originalMetadata);
Metadata metadata = lastMetadata;
diff --git
a/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
b/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
index 6976a0bba4..bf8e0b7ba0 100644
--- a/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
+++ b/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
@@ -218,6 +218,8 @@ public class TikaInputStreamTest {
public void testGetPathPreservesPosition() throws IOException {
byte[] data = bytes("Hello, World!");
try (TikaInputStream tis = TikaInputStream.get(new
ByteArrayInputStream(data))) {
+ tis.enableRewind(); // Enable caching for getPath() support after
reading
+
byte[] buf = new byte[5];
tis.read(buf);
assertEquals(5, tis.getPosition());
@@ -275,6 +277,8 @@ public class TikaInputStreamTest {
new Random(42).nextBytes(data);
try (TikaInputStream tis = TikaInputStream.get(new
ByteArrayInputStream(data))) {
+ tis.enableRewind(); // Enable caching for rewind support
+
byte[] buffer = new byte[data.length];
int totalRead = 0;
int n;
@@ -669,6 +673,158 @@ public class TikaInputStreamTest {
}
}
+ // ========== enableRewind() Tests ==========
+
+ @Test
+ public void testEnableRewindByteArrayNoOp() throws Exception {
+ // ByteArraySource is always rewindable - enableRewind() is no-op
+ byte[] data = bytes("Hello, World!");
+ try (TikaInputStream tis = TikaInputStream.get(data)) {
+ tis.enableRewind(); // Should be no-op
+
+ byte[] buf = new byte[5];
+ tis.read(buf);
+ assertEquals("Hello", str(buf));
+
+ tis.rewind();
+ assertEquals(0, tis.getPosition());
+
+ buf = new byte[5];
+ tis.read(buf);
+ assertEquals("Hello", str(buf));
+ }
+ }
+
+ @Test
+ public void testEnableRewindFileNoOp() throws Exception {
+ // FileSource is always rewindable - enableRewind() is no-op
+ Path tempFile = createTempFile("Hello, World!");
+ try (TikaInputStream tis = TikaInputStream.get(tempFile)) {
+ tis.enableRewind(); // Should be no-op
+
+ byte[] buf = new byte[5];
+ tis.read(buf);
+ assertEquals("Hello", str(buf));
+
+ tis.rewind();
+ assertEquals(0, tis.getPosition());
+
+ buf = new byte[5];
+ tis.read(buf);
+ assertEquals("Hello", str(buf));
+ }
+ }
+
+ @Test
+ public void testEnableRewindStreamEnablesCaching() throws Exception {
+ // CachingSource starts in passthrough mode, enableRewind() enables
caching
+ byte[] data = bytes("Hello, World!");
+ try (TikaInputStream tis = TikaInputStream.get(new
ByteArrayInputStream(data))) {
+ tis.enableRewind(); // Enable caching mode
+
+ byte[] buf = new byte[5];
+ tis.read(buf);
+ assertEquals("Hello", str(buf));
+
+ tis.rewind();
+ assertEquals(0, tis.getPosition());
+
+ buf = new byte[5];
+ tis.read(buf);
+ assertEquals("Hello", str(buf));
+ }
+ }
+
+ @Test
+ public void testEnableRewindAfterReadThrows() throws Exception {
+ // enableRewind() must be called at position 0
+ byte[] data = bytes("Hello, World!");
+ try (TikaInputStream tis = TikaInputStream.get(new
ByteArrayInputStream(data))) {
+ tis.read(); // Read one byte, position is now 1
+ assertEquals(1, tis.getPosition());
+
+ assertThrows(IllegalStateException.class, tis::enableRewind,
+ "enableRewind() should throw when position != 0");
+ }
+ }
+
+ @Test
+ public void testEnableRewindMultipleCallsNoOp() throws Exception {
+ // Multiple enableRewind() calls should be safe (no-op after first)
+ byte[] data = bytes("Hello, World!");
+ try (TikaInputStream tis = TikaInputStream.get(new
ByteArrayInputStream(data))) {
+ tis.enableRewind();
+ tis.enableRewind(); // Should be no-op
+ tis.enableRewind(); // Should be no-op
+
+ byte[] buf = readAllBytes(tis);
+ assertEquals("Hello, World!", str(buf));
+
+ tis.rewind();
+ buf = readAllBytes(tis);
+ assertEquals("Hello, World!", str(buf));
+ }
+ }
+
+ @Test
+ public void testStreamWithoutEnableRewindCannotRewind() throws Exception {
+ // Without enableRewind(), CachingSource is in passthrough mode
+ // rewind() should fail after reading in passthrough mode
+ byte[] data = bytes("Hello, World!");
+ try (TikaInputStream tis = TikaInputStream.get(new
ByteArrayInputStream(data))) {
+ // Don't call enableRewind()
+
+ byte[] buf = new byte[5];
+ tis.read(buf);
+ assertEquals("Hello", str(buf));
+
+ // rewind() internally calls reset() which calls seekTo()
+ // In passthrough mode, seekTo() fails if not at current position
+ assertThrows(IOException.class, tis::rewind,
+ "rewind() should fail in passthrough mode after reading");
+ }
+ }
+
+ @Test
+ public void testMarkResetThenEnableRewind() throws Exception {
+ // Test transitioning from passthrough mode (using
BufferedInputStream's mark/reset)
+ // to caching mode via enableRewind()
+ byte[] data = bytes("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
+ try (TikaInputStream tis = TikaInputStream.get(new
ByteArrayInputStream(data))) {
+ // Passthrough mode - use BufferedInputStream's mark/reset
+ tis.mark(100);
+ byte[] buf = new byte[5];
+ tis.read(buf);
+ assertEquals("ABCDE", str(buf));
+
+ tis.reset(); // Back to 0
+ assertEquals(0, tis.getPosition());
+
+ // Another mark/reset cycle in passthrough mode
+ tis.mark(100);
+ buf = new byte[10];
+ tis.read(buf);
+ assertEquals("ABCDEFGHIJ", str(buf));
+
+ tis.reset(); // Back to 0 again
+ assertEquals(0, tis.getPosition());
+
+ // Now enable rewind (switches to caching mode)
+ tis.enableRewind();
+
+ // Should still work with caching mode
+ buf = new byte[5];
+ tis.read(buf);
+ assertEquals("ABCDE", str(buf));
+
+ tis.rewind(); // Full rewind now works
+ assertEquals(0, tis.getPosition());
+
+ buf = readAllBytes(tis);
+ assertEquals("ABCDEFGHIJKLMNOPQRSTUVWXYZ", str(buf));
+ }
+ }
+
// ========== Helper Methods ==========
private TikaInputStream createTikaInputStream(byte[] data, boolean
fileBacked) throws IOException {
@@ -684,7 +840,9 @@ public class TikaInputStreamTest {
Files.write(file, data);
return TikaInputStream.get(file);
case STREAM:
- return TikaInputStream.get(new ByteArrayInputStream(data));
+ TikaInputStream tis = TikaInputStream.get(new
ByteArrayInputStream(data));
+ tis.enableRewind(); // Enable caching for rewind support in
tests
+ return tis;
default:
throw new IllegalArgumentException("Unknown backing type: " +
backingType);
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/parser/crypto/TSDParser.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/parser/crypto/TSDParser.java
index 0337729955..bbbb6d3e01 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/parser/crypto/TSDParser.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/parser/crypto/TSDParser.java
@@ -97,6 +97,8 @@ public class TSDParser implements Parser {
@Override
public void parse(TikaInputStream tis, ContentHandler handler, Metadata
metadata,
ParseContext context) throws IOException, SAXException,
TikaException {
+ // Enable rewind capability since we read metadata then rewind to
parse content
+ tis.enableRewind();
//Try to parse TSD file
Metadata TSDAndEmbeddedMetadata = new Metadata();
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
index f140989c1c..5b8aecbc0f 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
@@ -244,8 +244,8 @@ public class PackageParser extends
AbstractEncodingDetectorParser {
public void parse(TikaInputStream tis, ContentHandler handler, Metadata
metadata,
ParseContext context) throws IOException, SAXException,
TikaException {
-
- // TikaInputStream always supports mark
+ // Enable rewind capability since we may need to re-read for 7z or
data descriptor handling
+ tis.enableRewind();
TemporaryResources tmp = new TemporaryResources();
// Shield the TikaInputStream from being closed when we close archive
streams.
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/zip/utils/ZipSalvager.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/zip/utils/ZipSalvager.java
index 75c11f0387..52391bbf8c 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/zip/utils/ZipSalvager.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/zip/utils/ZipSalvager.java
@@ -56,6 +56,8 @@ public class ZipSalvager {
boolean allowStoredEntries) throws
IOException {
TikaInputStream tis = TikaInputStream.get(brokenZip);
+ // Enable rewind capability for retry on DATA_DESCRIPTOR feature
+ tis.enableRewind();
try {
try (ZipArchiveOutputStream outputStream = new
ZipArchiveOutputStream(salvagedZip);
ZipArchiveInputStream zipArchiveInputStream = new
ZipArchiveInputStream(