ppkarwasz commented on code in PR #710:
URL: https://github.com/apache/commons-compress/pull/710#discussion_r2391175988


##########
src/main/java/org/apache/commons/compress/archivers/tar/TarFile.java:
##########
@@ -65,7 +66,7 @@ private final class BoundedTarEntryInputStream extends 
BoundedArchiveInputStream
         BoundedTarEntryInputStream(final TarArchiveEntry entry, final 
SeekableByteChannel channel) throws IOException {
             super(entry.getDataOffset(), entry.getRealSize());
             if (channel.size() - entry.getSize() < entry.getDataOffset()) {
-                throw new ArchiveException("Entry size exceeds archive size");
+                throw new EOFException("Truncated TAR archive: entry size 
exceeds archive size");

Review Comment:
   I see `EOFException` a bit differently: it’s not a “low-level” `IOException` 
but always a higher-level signal that fewer bytes were delivered than expected. 
That’s why I initially thought it fit here. However, I’m not strongly attached 
to using it, especially since some users (e.g. in socket scenarios) treat 
`EOFException` as a benign “end of stream”.
   
   What I do want, like you, is for users to be able to distinguish causes. My 
angle is slightly different: I’d like them to be able to tell **truncation** 
from **corruption**, rather than just archive-layer vs. I/O-layer. One way to 
make that distinction clearer would be to introduce a dedicated 
`TruncatedArchiveException` for archive-level truncation cases. This could 
subclass either `EOFException` (to align with the semantics of “not enough 
data”) or `ArchiveException`.
   
   > That one doesn't make sense to me. That means that before throwing an 
exception, we'd have to read all the way to the end of an input stream? That 
sounds like a DoS waiting to happen.
   
   This check applies to both special TAR entries (such as PAX headers) and 
regular file entries. For special entries that are consumed internally, eager 
validation is acceptable (and essentially neutral from a security perspective, 
since we already use the same code paths for both `InputStream` of unknown size 
and `SeekableByteChannel`). For regular file entries, however, the exception 
should be deferred so that **users** can read all data actually available 
before being told the entry is incomplete.
   
   Take a backup TAR written to a nearly full filesystem that gets truncated: 
ideally, the user should still be able to recover the intact portion of the 
last entry before being told it is incomplete. Throwing in the 
`BoundedTarEntryInputStream` constructor prevents this; surfacing the error 
lazily in `read()` preserves recoverability.
   
   There’s also a less common but illustrative case: a TAR archive being read 
while another process is appending to it. This is inherently racy, but users 
would reasonably expect the exception to occur only when the reader overtakes 
the writer.
   
   **TL;DR**
   
   * I’m fine with replacing `EOFException` with a custom exception, but in 
unit tests I’d still like to:
   
     * Distinguish **truncation** from **corruption** cases.
     * *(Nitpick)* Verify that the exception message clearly reports how many 
bytes were missing (e.g., `Integer.MAX_VALUE` in a PAX header suggests 
corruption instead).
   * I’d drop the eager exception from `BoundedTarEntryInputStream`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to