This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/main by this push:
     new 5f9a808ac3 TIKA-4623 -- for general updates, don't buffer unless 
enableRewind has been set (#2534)
5f9a808ac3 is described below

commit 5f9a808ac316ca09699484d207e66614ed7fef5e
Author: Tim Allison <[email protected]>
AuthorDate: Thu Jan 15 12:45:32 2026 -0500

    TIKA-4623 -- for general updates, don't buffer unless enableRewind has been 
set (#2534)
---
 docs/spooling.adoc                                 |  72 ++++-----
 .../java/org/apache/tika/io/ByteArraySource.java   |  26 ++++
 .../org/apache/tika/io/CachingInputStream.java     |  19 ++-
 .../java/org/apache/tika/io/CachingSource.java     | 170 ++++++++++++++++++---
 .../main/java/org/apache/tika/io/FileSource.java   |  26 ++++
 .../java/org/apache/tika/io/TikaInputSource.java   |  13 ++
 .../java/org/apache/tika/io/TikaInputStream.java   |  47 +++++-
 .../parser/multiple/AbstractMultipleParser.java    |   3 +
 .../org/apache/tika/io/TikaInputStreamTest.java    | 160 ++++++++++++++++++-
 .../org/apache/tika/parser/crypto/TSDParser.java   |   2 +
 .../org/apache/tika/parser/pkg/PackageParser.java  |   4 +-
 .../org/apache/tika/zip/utils/ZipSalvager.java     |   2 +
 12 files changed, 476 insertions(+), 68 deletions(-)

diff --git a/docs/spooling.adoc b/docs/spooling.adoc
index 3385fd23a4..81d3bb18e4 100644
--- a/docs/spooling.adoc
+++ b/docs/spooling.adoc
@@ -26,44 +26,24 @@
 === What is Spooling?
 
 Spooling refers to the process of writing an input stream to a temporary file 
on disk.
-This is necessary for certain file formats that require random access to the 
underlying
-bytes during detection or parsing.
+This benefits certain file formats that can be processed more efficiently with 
random access
+to the underlying bytes during detection or parsing.
 
 === Why Some Formats Benefit from Random Access
 
 Several file formats are most efficiently processed with random access vs 
streaming:
 
-* **OLE2 (Microsoft Office legacy formats)**: The POI library needs to read 
the file
+* **OLE2 (Microsoft Office legacy formats)**: The POI library benefits from 
reading the file
   as a random-access structure to navigate the OLE2 container.
-* **ZIP-based formats**: Container detection requires reading the ZIP central 
directory,
-  which is located at the end of the file.
-* **Binary Property Lists (bplist)**: Apple's binary plist format requires 
random access
+* **ZIP-based formats**: Container detection benefits from reading the ZIP 
central directory,
+  which is located at the end of the file. Parsing also benefits from random 
access.
+* **Binary Property Lists (bplist)**: Apple's binary plist format benefits 
from random access
   for efficient parsing.
-* **PDF**: While detection works via magic bytes, parsing requires random 
access for
+* **PDF**: While detection works via magic bytes, parsing benefits from random 
access for
   the PDF cross-reference table.
 
 === Architectural Decision: Decentralized Spooling
 
-==== The Problem with Centralized Spooling
-
-Earlier versions of Tika considered centralizing spooling decisions in 
`DefaultDetector`.
-The detector would check the detected media type and spool to disk before 
passing the
-stream to specialized detectors or parsers.
-
-This approach had several drawbacks:
-
-1. **Unnecessary spooling**: PDF files need spooling for _parsing_ but not for 
_detection_
-   (magic bytes suffice). Centralized detection-time spooling would spool PDFs 
unnecessarily
-   when only detecting.
-
-2. **Redundant logic**: Specialized detectors like `POIFSContainerDetector` and
-   `DefaultZipContainerDetector` already call `TikaInputStream.getFile()` or 
`getPath()`
-   when they need random access. They know best when spooling is required.
-
-3. **Coupling**: Centralized spooling couples the detector to knowledge about 
which
-   formats need random access, duplicating logic that already exists in 
specialized
-   components.
-
 ==== The Solution: Let Components Self-Spool
 
 The current architecture follows a simple principle: **each component that 
needs random
@@ -78,11 +58,13 @@ Path path = TikaInputStream.get(inputStream).getPath();
 File file = TikaInputStream.get(inputStream).getFile();
 ----
 
-`TikaInputStream` handles the spooling transparently:
+`TikaInputStream` handles the spooling transparently based on how it was 
initialized:
 
-* If the stream is already backed by a file, it returns that file directly.
-* If the stream is in-memory or network-based, it spools to a temporary file.
-* The temporary file is automatically cleaned up when the stream is closed.
+* **Initialized with `Path`**: The file is used directly for random access. No 
spooling needed.
+* **Initialized with `byte[]`**: The bytes are kept in memory. Spooling only 
on demand.
+* **Initialized with `InputStream`**: When `getPath()` or `getFile()` is 
called, the stream
+  is dynamically buffered to memory first, then spills to a temporary file 
after a threshold.
+  The temporary file is automatically cleaned up when the stream is closed.
 
 ==== Benefits of Decentralized Spooling
 
@@ -105,7 +87,7 @@ file management. This means:
 === Default Behavior
 
 By default, Tika handles spooling automatically. You don't need to configure 
anything
-for most use cases. When a detector or parser needs random access to a file, 
it will
+for most use cases. When a detector or parser benefits from random access to a 
file, it will
 spool the input stream to a temporary file if necessary.
 
 === SpoolingStrategy for Fine-Grained Control
@@ -113,7 +95,7 @@ spool the input stream to a temporary file if necessary.
 For advanced use cases, you can use `SpoolingStrategy` to control spooling 
behavior.
 This is useful when you want to:
 
-* Restrict which file types are allowed to spool (e.g., for security reasons)
+* Restrict which file types are allowed to spool (e.g., for performance 
reasons)
 * Customize spooling behavior based on metadata or stream properties
 
 ==== Programmatic Configuration
@@ -211,8 +193,26 @@ context.set(SpoolingStrategy.class, strategy);
 1. **Let Tika handle it**: For most applications, the default behavior is 
optimal.
    Don't configure spooling unless you have a specific need.
 
-2. **Use TikaInputStream**: Always wrap your input streams with 
`TikaInputStream`
-   to enable efficient spooling and rewind capabilities.
+2. **Use TikaInputStream with Path or byte[]**: When you have a file, pass the 
`Path`
+   directly to `TikaInputStream.get(Path)` rather than wrapping a 
`FileInputStream`.
+   Similarly, pass `byte[]` directly rather than wrapping a 
`ByteArrayInputStream`.
+   This allows TikaInputStream to use efficient backing strategies that avoid 
unnecessary
+   copying or spooling:
++
+[source,java]
+----
+// Good: TikaInputStream knows it has a file, can use random access directly
+TikaInputStream tis = TikaInputStream.get(path);
+
+// Bad: TikaInputStream sees an opaque stream, may spool unnecessarily
+TikaInputStream tis = TikaInputStream.get(new FileInputStream(file));
+
+// Good: TikaInputStream knows it has bytes in memory
+TikaInputStream tis = TikaInputStream.get(bytes);
+
+// Bad: TikaInputStream sees an opaque stream
+TikaInputStream tis = TikaInputStream.get(new ByteArrayInputStream(bytes));
+----
 
 3. **Close streams properly**: Use try-with-resources to ensure temporary files
    are cleaned up:
@@ -225,5 +225,5 @@ try (TikaInputStream tis = 
TikaInputStream.get(inputStream)) {
 ----
 
 4. **Consider memory vs. disk tradeoffs**: For very large files, spooling to 
disk
-   is necessary. For small files processed in bulk, keeping data in memory may 
be
+   may be needed. For small files processed in bulk, keeping data in memory 
may be
    faster. `TikaInputStream` backing strategies can be tuned for your workload.
diff --git a/tika-core/src/main/java/org/apache/tika/io/ByteArraySource.java 
b/tika-core/src/main/java/org/apache/tika/io/ByteArraySource.java
index 3d19a04a5d..a9dcd8da96 100644
--- a/tika-core/src/main/java/org/apache/tika/io/ByteArraySource.java
+++ b/tika-core/src/main/java/org/apache/tika/io/ByteArraySource.java
@@ -112,4 +112,30 @@ class ByteArraySource extends InputStream implements 
TikaInputSource {
     public long getLength() {
         return length;
     }
+
+    @Override
+    public void enableRewind() {
+        // No-op: byte array is always rewindable
+    }
+
+    // Mark/reset support using position tracking
+    private int markPosition = -1;
+
+    @Override
+    public synchronized void mark(int readlimit) {
+        markPosition = position;
+    }
+
+    @Override
+    public synchronized void reset() throws IOException {
+        if (markPosition < 0) {
+            throw new IOException("Mark not set");
+        }
+        position = markPosition;
+    }
+
+    @Override
+    public boolean markSupported() {
+        return true;
+    }
 }
diff --git a/tika-core/src/main/java/org/apache/tika/io/CachingInputStream.java 
b/tika-core/src/main/java/org/apache/tika/io/CachingInputStream.java
index 3cbd859f31..b9474b1cae 100644
--- a/tika-core/src/main/java/org/apache/tika/io/CachingInputStream.java
+++ b/tika-core/src/main/java/org/apache/tika/io/CachingInputStream.java
@@ -192,9 +192,24 @@ class CachingInputStream extends InputStream {
         return source.available();
     }
 
+    // Mark/reset support using seekTo
+    private long markPosition = -1;
+
+    @Override
+    public synchronized void mark(int readlimit) {
+        markPosition = position;
+    }
+
+    @Override
+    public synchronized void reset() throws IOException {
+        if (markPosition < 0) {
+            throw new IOException("Mark not set");
+        }
+        seekTo(markPosition);
+    }
+
     @Override
     public boolean markSupported() {
-        // Mark/reset is handled at the TikaInputStream level
-        return false;
+        return true;
     }
 }
diff --git a/tika-core/src/main/java/org/apache/tika/io/CachingSource.java 
b/tika-core/src/main/java/org/apache/tika/io/CachingSource.java
index baf38c7cd8..07c6f0fdc3 100644
--- a/tika-core/src/main/java/org/apache/tika/io/CachingSource.java
+++ b/tika-core/src/main/java/org/apache/tika/io/CachingSource.java
@@ -25,55 +25,94 @@ import java.nio.file.Path;
 import org.apache.commons.io.IOUtils;
 
 /**
- * Input source that caches bytes from a raw InputStream.
+ * Input source that wraps a raw InputStream with optional caching.
  * <p>
- * Uses {@link CachingInputStream} to cache bytes as they are read,
- * enabling mark/reset/seek operations. If the cache exceeds a threshold,
- * it spills to a temporary file via {@link StreamCache}.
+ * Starts in passthrough mode using {@link BufferedInputStream} for basic
+ * mark/reset support. When {@link #enableRewind()} is called (at position 0),
+ * switches to caching mode using {@link CachingInputStream} which enables
+ * full rewind/seek capability.
+ * <p>
+ * If caching is not enabled, {@link #seekTo(long)} will fail for any position
+ * other than the current position.
  */
 class CachingSource extends InputStream implements TikaInputSource {
 
     private final TemporaryResources tmp;
-    private CachingInputStream cachingStream;
     private long length;
 
+    // Passthrough mode: just a BufferedInputStream
+    private BufferedInputStream passthroughStream;
+    private long passthroughPosition;
+
+    // Caching mode: CachingInputStream for full rewind support
+    private CachingInputStream cachingStream;
+
     // After spilling to file, we switch to file-backed mode
     private Path spilledPath;
     private InputStream fileStream;
+    private long filePosition;  // Track position in file mode
 
     CachingSource(InputStream source, TemporaryResources tmp, long length) {
         this.tmp = tmp;
         this.length = length;
-        StreamCache cache = new StreamCache(tmp);
-        this.cachingStream = new CachingInputStream(
-                source instanceof BufferedInputStream ? source : new 
BufferedInputStream(source),
-                cache
-        );
+        // Start in passthrough mode
+        this.passthroughStream = source instanceof BufferedInputStream
+                ? (BufferedInputStream) source
+                : new BufferedInputStream(source);
+        this.passthroughPosition = 0;
     }
 
     @Override
     public int read() throws IOException {
         if (fileStream != null) {
-            return fileStream.read();
+            int b = fileStream.read();
+            if (b != -1) {
+                filePosition++;
+            }
+            return b;
         }
-        return cachingStream.read();
+        if (cachingStream != null) {
+            return cachingStream.read();
+        }
+        int b = passthroughStream.read();
+        if (b != -1) {
+            passthroughPosition++;
+        }
+        return b;
     }
 
     @Override
     public int read(byte[] b, int off, int len) throws IOException {
         if (fileStream != null) {
-            return fileStream.read(b, off, len);
+            int n = fileStream.read(b, off, len);
+            if (n > 0) {
+                filePosition += n;
+            }
+            return n;
+        }
+        if (cachingStream != null) {
+            return cachingStream.read(b, off, len);
+        }
+        int n = passthroughStream.read(b, off, len);
+        if (n > 0) {
+            passthroughPosition += n;
         }
-        return cachingStream.read(b, off, len);
+        return n;
     }
 
     @Override
     public long skip(long n) throws IOException {
         if (fileStream != null) {
-            return IOUtils.skip(fileStream, n);
+            long skipped = IOUtils.skip(fileStream, n);
+            filePosition += skipped;
+            return skipped;
         }
-        //this is safe because cachingStream already does a read instead of 
trusting the skip
-        return cachingStream.skip(n);
+        if (cachingStream != null) {
+            return cachingStream.skip(n);
+        }
+        long skipped = IOUtils.skip(passthroughStream, n);
+        passthroughPosition += skipped;
+        return skipped;
     }
 
     @Override
@@ -81,7 +120,74 @@ class CachingSource extends InputStream implements 
TikaInputSource {
         if (fileStream != null) {
             return fileStream.available();
         }
-        return cachingStream.available();
+        if (cachingStream != null) {
+            return cachingStream.available();
+        }
+        return passthroughStream.available();
+    }
+
+    // Track mark position across all modes
+    private long markPosition = -1;
+
+    @Override
+    public synchronized void mark(int readlimit) {
+        if (fileStream != null) {
+            // File mode - track position for seekTo-based reset
+            markPosition = filePosition;
+            return;
+        }
+        if (cachingStream != null) {
+            // Caching mode - track position for seekTo-based reset
+            markPosition = cachingStream.getPosition();
+            return;
+        }
+        // Passthrough mode - delegate to BufferedInputStream
+        passthroughStream.mark(readlimit);
+        markPosition = passthroughPosition;
+    }
+
+    @Override
+    public synchronized void reset() throws IOException {
+        if (markPosition < 0) {
+            throw new IOException("Mark not set");
+        }
+        if (fileStream != null) {
+            // File mode - use seekTo
+            seekTo(markPosition);
+            return;
+        }
+        if (cachingStream != null) {
+            // Caching mode - use seekTo
+            cachingStream.seekTo(markPosition);
+            return;
+        }
+        // Passthrough mode - delegate to BufferedInputStream
+        passthroughStream.reset();
+        passthroughPosition = markPosition;
+    }
+
+    @Override
+    public boolean markSupported() {
+        return true;
+    }
+
+    @Override
+    public void enableRewind() {
+        // Already in caching or file mode - no-op
+        if (cachingStream != null || fileStream != null) {
+            return;
+        }
+
+        if (passthroughPosition != 0) {
+            throw new IllegalStateException(
+                    "Cannot enable rewind: position is " + passthroughPosition 
+
+                            ", must be 0. Call enableRewind() before 
reading.");
+        }
+
+        // Switch to caching mode
+        StreamCache cache = new StreamCache(tmp);
+        cachingStream = new CachingInputStream(passthroughStream, cache);
+        passthroughStream = null;
     }
 
     @Override
@@ -93,8 +199,20 @@ class CachingSource extends InputStream implements 
TikaInputSource {
             if (position > 0) {
                 IOUtils.skipFully(fileStream, position);
             }
-        } else {
+            filePosition = position;
+            return;
+        }
+
+        if (cachingStream != null) {
             cachingStream.seekTo(position);
+            return;
+        }
+
+        // Passthrough mode - can only "seek" to current position
+        if (position != passthroughPosition) {
+            throw new IOException(
+                    "Cannot seek in passthrough mode. Call enableRewind() 
first. " +
+                            "Current position: " + passthroughPosition + ", 
requested: " + position);
         }
     }
 
@@ -106,6 +224,16 @@ class CachingSource extends InputStream implements 
TikaInputSource {
     @Override
     public Path getPath(TemporaryResources tmp, String suffix) throws 
IOException {
         if (spilledPath == null) {
+            // If still in passthrough mode, enable caching first
+            if (cachingStream == null) {
+                if (passthroughPosition != 0) {
+                    throw new IOException(
+                            "Cannot spill to file: position is " + 
passthroughPosition +
+                                    ", must be 0. Call enableRewind() before 
reading if you need file access.");
+                }
+                enableRewind();
+            }
+
             // Spill to file and switch to file-backed mode
             spilledPath = cachingStream.spillToFile(suffix);
 
@@ -120,6 +248,7 @@ class CachingSource extends InputStream implements 
TikaInputSource {
             if (currentPosition > 0) {
                 IOUtils.skipFully(fileStream, currentPosition);
             }
+            filePosition = currentPosition;
 
             // Update length from file size
             long fileSize = Files.size(spilledPath);
@@ -145,5 +274,8 @@ class CachingSource extends InputStream implements 
TikaInputSource {
         if (cachingStream != null) {
             cachingStream.close();
         }
+        if (passthroughStream != null) {
+            passthroughStream.close();
+        }
     }
 }
diff --git a/tika-core/src/main/java/org/apache/tika/io/FileSource.java 
b/tika-core/src/main/java/org/apache/tika/io/FileSource.java
index e89690a086..95f6458574 100644
--- a/tika-core/src/main/java/org/apache/tika/io/FileSource.java
+++ b/tika-core/src/main/java/org/apache/tika/io/FileSource.java
@@ -118,4 +118,30 @@ class FileSource extends InputStream implements 
TikaInputSource {
             currentStream.close();
         }
     }
+
+    @Override
+    public void enableRewind() {
+        // No-op: file is always rewindable
+    }
+
+    // Mark/reset support using seekTo
+    private long markPosition = -1;
+
+    @Override
+    public synchronized void mark(int readlimit) {
+        markPosition = position;
+    }
+
+    @Override
+    public synchronized void reset() throws IOException {
+        if (markPosition < 0) {
+            throw new IOException("Mark not set");
+        }
+        seekTo(markPosition);
+    }
+
+    @Override
+    public boolean markSupported() {
+        return true;
+    }
 }
diff --git a/tika-core/src/main/java/org/apache/tika/io/TikaInputSource.java 
b/tika-core/src/main/java/org/apache/tika/io/TikaInputSource.java
index 10e2b52dd8..7a8da5d703 100644
--- a/tika-core/src/main/java/org/apache/tika/io/TikaInputSource.java
+++ b/tika-core/src/main/java/org/apache/tika/io/TikaInputSource.java
@@ -53,4 +53,17 @@ interface TikaInputSource extends Closeable {
      * Returns the length of the content, or -1 if unknown.
      */
     long getLength();
+
+    /**
+     * Enables full rewind capability.
+     * <p>
+     * For ByteArraySource and FileSource, this is a no-op (always rewindable).
+     * For CachingSource, this switches from passthrough mode to caching mode,
+     * enabling subsequent {@link #seekTo(long)} and rewind operations.
+     * <p>
+     * Must be called when position is 0, otherwise throws 
IllegalStateException.
+     *
+     * @throws IllegalStateException if position is not 0
+     */
+    void enableRewind();
 }
diff --git a/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java 
b/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java
index c87a5ea09d..7eee791e3a 100644
--- a/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java
+++ b/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java
@@ -299,12 +299,8 @@ public class TikaInputStream extends TaggedInputStream {
             throw new IOException("Resetting to invalid mark");
         }
 
-        TikaInputSource source = inputSource();
-        if (source != null) {
-            source.seekTo(mark);
-        } else {
-            throw new IOException("Cannot reset: no TikaInputSource 
available");
-        }
+        // Delegate to underlying stream's reset (handles passthrough and 
caching modes)
+        super.reset();
 
         position = mark;
         // Don't invalidate mark - allow multiple reset() calls to same mark
@@ -447,11 +443,46 @@ public class TikaInputStream extends TaggedInputStream {
 
     /**
      * Rewind the stream to the beginning.
+     * <p>
+     * For streams created from byte arrays or files, this always works.
+     * For streams created from raw InputStreams, this requires
+     * {@link #enableRewind()} to have been called first.
      */
     public void rewind() throws IOException {
-        mark = 0;
-        reset();
+        TikaInputSource source = inputSource();
+        if (source != null) {
+            source.seekTo(0);
+        } else {
+            throw new IOException("Cannot rewind: no TikaInputSource 
available");
+        }
+        position = 0;
         mark = -1;
+        consecutiveEOFs = 0;
+    }
+
+    /**
+     * Enables full rewind capability for this stream.
+     * <p>
+     * For streams backed by byte arrays or files, this is a no-op since they
+     * are inherently rewindable. For streams backed by raw InputStreams, this
+     * switches from passthrough mode to caching mode, enabling subsequent
+     * {@link #rewind()}, {@link #mark(int)}/{@link #reset()}, and random 
access.
+     * <p>
+     * Must be called when position is 0 (before any reading), otherwise
+     * throws IllegalStateException.
+     * <p>
+     * Use this method when you know you'll need to rewind the stream later
+     * (e.g., for detection followed by parsing, or digest calculation).
+     * For streaming-only operations (e.g., HTML parsing), skip this call
+     * to avoid unnecessary caching overhead.
+     *
+     * @throws IllegalStateException if position is not 0
+     */
+    public void enableRewind() {
+        TikaInputSource source = inputSource();
+        if (source != null) {
+            source.enableRewind();
+        }
     }
 
     @Override
diff --git 
a/tika-core/src/main/java/org/apache/tika/parser/multiple/AbstractMultipleParser.java
 
b/tika-core/src/main/java/org/apache/tika/parser/multiple/AbstractMultipleParser.java
index 2cf757db2b..17eff0515e 100644
--- 
a/tika-core/src/main/java/org/apache/tika/parser/multiple/AbstractMultipleParser.java
+++ 
b/tika-core/src/main/java/org/apache/tika/parser/multiple/AbstractMultipleParser.java
@@ -244,6 +244,9 @@ public abstract class AbstractMultipleParser implements 
Parser {
     private void parse(TikaInputStream tis, ContentHandler handler,
                        ContentHandlerFactory handlerFactory, Metadata 
originalMetadata,
                        ParseContext context) throws IOException, SAXException, 
TikaException {
+        // Enable rewind capability since we rewind between multiple parsers
+        tis.enableRewind();
+
         // Track the metadata between parsers, so we can apply our policy
         Metadata lastMetadata = cloneMetadata(originalMetadata);
         Metadata metadata = lastMetadata;
diff --git 
a/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java 
b/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
index 6976a0bba4..bf8e0b7ba0 100644
--- a/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
+++ b/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
@@ -218,6 +218,8 @@ public class TikaInputStreamTest {
     public void testGetPathPreservesPosition() throws IOException {
         byte[] data = bytes("Hello, World!");
         try (TikaInputStream tis = TikaInputStream.get(new 
ByteArrayInputStream(data))) {
+            tis.enableRewind(); // Enable caching for getPath() support after 
reading
+
             byte[] buf = new byte[5];
             tis.read(buf);
             assertEquals(5, tis.getPosition());
@@ -275,6 +277,8 @@ public class TikaInputStreamTest {
         new Random(42).nextBytes(data);
 
         try (TikaInputStream tis = TikaInputStream.get(new 
ByteArrayInputStream(data))) {
+            tis.enableRewind(); // Enable caching for rewind support
+
             byte[] buffer = new byte[data.length];
             int totalRead = 0;
             int n;
@@ -669,6 +673,158 @@ public class TikaInputStreamTest {
         }
     }
 
+    // ========== enableRewind() Tests ==========
+
+    @Test
+    public void testEnableRewindByteArrayNoOp() throws Exception {
+        // ByteArraySource is always rewindable - enableRewind() is no-op
+        byte[] data = bytes("Hello, World!");
+        try (TikaInputStream tis = TikaInputStream.get(data)) {
+            tis.enableRewind(); // Should be no-op
+
+            byte[] buf = new byte[5];
+            tis.read(buf);
+            assertEquals("Hello", str(buf));
+
+            tis.rewind();
+            assertEquals(0, tis.getPosition());
+
+            buf = new byte[5];
+            tis.read(buf);
+            assertEquals("Hello", str(buf));
+        }
+    }
+
+    @Test
+    public void testEnableRewindFileNoOp() throws Exception {
+        // FileSource is always rewindable - enableRewind() is no-op
+        Path tempFile = createTempFile("Hello, World!");
+        try (TikaInputStream tis = TikaInputStream.get(tempFile)) {
+            tis.enableRewind(); // Should be no-op
+
+            byte[] buf = new byte[5];
+            tis.read(buf);
+            assertEquals("Hello", str(buf));
+
+            tis.rewind();
+            assertEquals(0, tis.getPosition());
+
+            buf = new byte[5];
+            tis.read(buf);
+            assertEquals("Hello", str(buf));
+        }
+    }
+
+    @Test
+    public void testEnableRewindStreamEnablesCaching() throws Exception {
+        // CachingSource starts in passthrough mode, enableRewind() enables 
caching
+        byte[] data = bytes("Hello, World!");
+        try (TikaInputStream tis = TikaInputStream.get(new 
ByteArrayInputStream(data))) {
+            tis.enableRewind(); // Enable caching mode
+
+            byte[] buf = new byte[5];
+            tis.read(buf);
+            assertEquals("Hello", str(buf));
+
+            tis.rewind();
+            assertEquals(0, tis.getPosition());
+
+            buf = new byte[5];
+            tis.read(buf);
+            assertEquals("Hello", str(buf));
+        }
+    }
+
+    @Test
+    public void testEnableRewindAfterReadThrows() throws Exception {
+        // enableRewind() must be called at position 0
+        byte[] data = bytes("Hello, World!");
+        try (TikaInputStream tis = TikaInputStream.get(new 
ByteArrayInputStream(data))) {
+            tis.read(); // Read one byte, position is now 1
+            assertEquals(1, tis.getPosition());
+
+            assertThrows(IllegalStateException.class, tis::enableRewind,
+                    "enableRewind() should throw when position != 0");
+        }
+    }
+
+    @Test
+    public void testEnableRewindMultipleCallsNoOp() throws Exception {
+        // Multiple enableRewind() calls should be safe (no-op after first)
+        byte[] data = bytes("Hello, World!");
+        try (TikaInputStream tis = TikaInputStream.get(new 
ByteArrayInputStream(data))) {
+            tis.enableRewind();
+            tis.enableRewind(); // Should be no-op
+            tis.enableRewind(); // Should be no-op
+
+            byte[] buf = readAllBytes(tis);
+            assertEquals("Hello, World!", str(buf));
+
+            tis.rewind();
+            buf = readAllBytes(tis);
+            assertEquals("Hello, World!", str(buf));
+        }
+    }
+
+    @Test
+    public void testStreamWithoutEnableRewindCannotRewind() throws Exception {
+        // Without enableRewind(), CachingSource is in passthrough mode
+        // rewind() should fail after reading in passthrough mode
+        byte[] data = bytes("Hello, World!");
+        try (TikaInputStream tis = TikaInputStream.get(new 
ByteArrayInputStream(data))) {
+            // Don't call enableRewind()
+
+            byte[] buf = new byte[5];
+            tis.read(buf);
+            assertEquals("Hello", str(buf));
+
+            // rewind() internally calls reset() which calls seekTo()
+            // In passthrough mode, seekTo() fails if not at current position
+            assertThrows(IOException.class, tis::rewind,
+                    "rewind() should fail in passthrough mode after reading");
+        }
+    }
+
+    @Test
+    public void testMarkResetThenEnableRewind() throws Exception {
+        // Test transitioning from passthrough mode (using 
BufferedInputStream's mark/reset)
+        // to caching mode via enableRewind()
+        byte[] data = bytes("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
+        try (TikaInputStream tis = TikaInputStream.get(new 
ByteArrayInputStream(data))) {
+            // Passthrough mode - use BufferedInputStream's mark/reset
+            tis.mark(100);
+            byte[] buf = new byte[5];
+            tis.read(buf);
+            assertEquals("ABCDE", str(buf));
+
+            tis.reset();  // Back to 0
+            assertEquals(0, tis.getPosition());
+
+            // Another mark/reset cycle in passthrough mode
+            tis.mark(100);
+            buf = new byte[10];
+            tis.read(buf);
+            assertEquals("ABCDEFGHIJ", str(buf));
+
+            tis.reset();  // Back to 0 again
+            assertEquals(0, tis.getPosition());
+
+            // Now enable rewind (switches to caching mode)
+            tis.enableRewind();
+
+            // Should still work with caching mode
+            buf = new byte[5];
+            tis.read(buf);
+            assertEquals("ABCDE", str(buf));
+
+            tis.rewind();  // Full rewind now works
+            assertEquals(0, tis.getPosition());
+
+            buf = readAllBytes(tis);
+            assertEquals("ABCDEFGHIJKLMNOPQRSTUVWXYZ", str(buf));
+        }
+    }
+
     // ========== Helper Methods ==========
 
     private TikaInputStream createTikaInputStream(byte[] data, boolean 
fileBacked) throws IOException {
@@ -684,7 +840,9 @@ public class TikaInputStreamTest {
                 Files.write(file, data);
                 return TikaInputStream.get(file);
             case STREAM:
-                return TikaInputStream.get(new ByteArrayInputStream(data));
+                TikaInputStream tis = TikaInputStream.get(new 
ByteArrayInputStream(data));
+                tis.enableRewind(); // Enable caching for rewind support in 
tests
+                return tis;
             default:
                 throw new IllegalArgumentException("Unknown backing type: " + 
backingType);
         }
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/parser/crypto/TSDParser.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/parser/crypto/TSDParser.java
index 0337729955..bbbb6d3e01 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/parser/crypto/TSDParser.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/src/main/java/org/apache/tika/parser/crypto/TSDParser.java
@@ -97,6 +97,8 @@ public class TSDParser implements Parser {
     @Override
     public void parse(TikaInputStream tis, ContentHandler handler, Metadata 
metadata,
                       ParseContext context) throws IOException, SAXException, 
TikaException {
+        // Enable rewind capability since we read metadata then rewind to 
parse content
+        tis.enableRewind();
 
         //Try to parse TSD file
         Metadata TSDAndEmbeddedMetadata = new Metadata();
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
index f140989c1c..5b8aecbc0f 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
@@ -244,8 +244,8 @@ public class PackageParser extends 
AbstractEncodingDetectorParser {
 
     public void parse(TikaInputStream tis, ContentHandler handler, Metadata 
metadata,
                       ParseContext context) throws IOException, SAXException, 
TikaException {
-
-        // TikaInputStream always supports mark
+        // Enable rewind capability since we may need to re-read for 7z or 
data descriptor handling
+        tis.enableRewind();
 
         TemporaryResources tmp = new TemporaryResources();
         // Shield the TikaInputStream from being closed when we close archive 
streams.
diff --git 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/zip/utils/ZipSalvager.java
 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/zip/utils/ZipSalvager.java
index 75c11f0387..52391bbf8c 100644
--- 
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/zip/utils/ZipSalvager.java
+++ 
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/zip/utils/ZipSalvager.java
@@ -56,6 +56,8 @@ public class ZipSalvager {
                                    boolean allowStoredEntries) throws 
IOException {
 
         TikaInputStream tis = TikaInputStream.get(brokenZip);
+        // Enable rewind capability for retry on DATA_DESCRIPTOR feature
+        tis.enableRewind();
         try {
             try (ZipArchiveOutputStream outputStream = new 
ZipArchiveOutputStream(salvagedZip);
                     ZipArchiveInputStream zipArchiveInputStream = new 
ZipArchiveInputStream(


Reply via email to