ppkarwasz commented on code in PR #710:
URL: https://github.com/apache/commons-compress/pull/710#discussion_r2391175988
##########
src/main/java/org/apache/commons/compress/archivers/tar/TarFile.java:
##########
@@ -65,7 +66,7 @@ private final class BoundedTarEntryInputStream extends
BoundedArchiveInputStream
BoundedTarEntryInputStream(final TarArchiveEntry entry, final
SeekableByteChannel channel) throws IOException {
super(entry.getDataOffset(), entry.getRealSize());
if (channel.size() - entry.getSize() < entry.getDataOffset()) {
- throw new ArchiveException("Entry size exceeds archive size");
+ throw new EOFException("Truncated TAR archive: entry size
exceeds archive size");
Review Comment:
I see `EOFException` a bit differently: it’s not a “low-level” `IOException`
but always a higher-level signal that fewer bytes were delivered than expected.
That’s why I initially thought it fit here. However, I’m not strongly attached
to using it, especially since some users (e.g. in socket scenarios) treat
`EOFException` as a benign “end of stream”.
What I do want, like you, is for users to be able to distinguish causes. My
angle is slightly different: I’d like them to be able to tell **truncation**
from **corruption**, rather than just archive-layer vs. I/O-layer. One way to
make that distinction clearer would be to introduce a dedicated
`TruncatedArchiveException` for archive-level truncation cases. This could
subclass either `EOFException` (to align with the semantics of “not enough
data”) or `ArchiveException`.
> That one doesn't make sense to me. That means that before throwing an
exception, we'd have to read all the way to the end of an input stream? That
sounds like a DoS waiting to happen.
This check applies to both special TAR entries (such as PAX headers) and
regular file entries. For special entries that are consumed internally, eager
validation is acceptable (and essentially neutral from a security perspective,
since we already use the same code paths for both `InputStream` of unknown size
and `SeekableByteChannel`). For regular file entries, however, the exception
should be deferred so that **users** can read all data actually available
before being told the entry is incomplete.
Take a backup TAR written to a nearly full filesystem that gets truncated:
ideally, the user should still be able to recover the intact portion of the
last entry before being told it is incomplete. Throwing in the
`BoundedTarEntryInputStream` constructor prevents this; surfacing the error
lazily in `read()` preserves recoverability.
There’s also a less common but illustrative case: a TAR archive being read
while another process is appending to it. This is inherently racy, but users
would reasonably expect the exception to occur only when the reader overtakes
the writer.
**TL;DR**
* I’m fine with replacing `EOFException` with a custom exception, but in
unit tests I’d still like to:
* Distinguish **truncation** from **corruption** cases.
* *(Nitpick)* Verify that the exception message clearly reports how many
bytes were missing (e.g., `Integer.MAX_VALUE` in a PAX header suggests
corruption instead).
* I’d drop the eager exception from `BoundedTarEntryInputStream`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]