[
https://issues.apache.org/jira/browse/COMPRESS-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092163#comment-18092163
]
Tomas Illuminati commented on COMPRESS-726:
-------------------------------------------
[~pkarwasz]
Patch attached, I take the time to make a PoC of the Improve and exported from
git as a .patch. Applies on a clean master with git apply to test it.
[^COMPRESS-726.patch]
Both of your comments are in it. The absolute cap is composed on Commons IO
BoundedInputStream with a throwing setOnMaxCount, so it raises instead of
returning EOF ([~ggregory]). The ratio guard trips on the consumed bytes from
getCompressedCount(), with a configurable grace (Piotr).
The one thing I didn't do literally is the default. I left it off rather than a
default 100x. At the codec layer a default 100x throws on valid input, since
DEFLATE alone reaches about 1032x. memoryLimitInKb and BoundedInputStream's max
both default to -1, so off is the consistent posture here. The Tika
SecureContentHandler config (ratio 100 with a grace) is documented as the
recommended starting point. If you want it on at the factory, one line, and I
don't mind either way.
Shape. CompressorInputStream implements InputStreamStatistics with
getCompressedCount() defaulting to -1 (change 1).
BombGuardCompressorInputStream wraps the stream, and CompressorStreamFactory
wraps only when a limit is set, so unconfigured callers get the concrete stream
unchanged. Setters are setMaxDecompressedSize, setMaxCompressionRatio and
setCompressionRatioGraceBytes. Pack200 has no compressed count, so its cap
works (the guard counts output) but the ratio is skipped.
One finding worth its own ticket. getUncompressedCount() is not uniformly exact
today. FramedLZ4 and FramedSnappy feed count() the compressed bytes, so they
report the compressed size as uncompressed. The guard counts what it returns,
so it sidesteps this, but an in-base cap on getBytesRead() would have silently
missed both.
98 tests over the cap and ratio, the grace, the six formats, Pack200, the
factory wiring and a Tika-equivalent config on a real bomb. Full mvn green
before and after the patch, 4595 to 4693 tests, no regressions.
> Build decompression-bomb protection into CompressorInputStream
> --------------------------------------------------------------
>
> Key: COMPRESS-726
> URL: https://issues.apache.org/jira/browse/COMPRESS-726
> Project: Commons Compress
> Issue Type: Improvement
> Reporter: Piotr Karwasz
> Priority: Major
> Attachments: COMPRESS-726.patch
>
>
> Every concrete {{CompressorInputStream}} already tracks both the compressed
> bytes it has consumed and the uncompressed bytes it has emitted, through
> {{InputStreamStatistics}}. The per-format differences in how the compressed
> count is sourced (raw-input counter vs. internal bit/byte reader) amount only
> to a bounded read-ahead skew (at most one decompressor buffer), which is
> irrelevant for decompression-bomb detection.
> We could therefore rely on {{InputStreamStatistics}} to introduce
> decompression-bomb protection into each {{CompressorInputStream}}. Concretely:
> # Make {{CompressorInputStream}} implement {{InputStreamStatistics}}, so the
> compressed/uncompressed counts are part of the base type for every format.
> # Let users configure decompression-bomb protection parameters on the streams
> the builders / {{CompressorStreamFactory}} return, enforced centrally in the
> base class.
> h2. Background: InputStreamStatistics is sufficient
> {{getUncompressedCount()}} is uniformly exact: it is the base
> {{CompressorInputStream.getBytesRead()}} counter, fed by {{count(...)}} on
> every {{read}}, so it equals the number of bytes actually handed to the
> caller.
> {{getCompressedCount()}} is a live counter in every implementation. Some
> source it from a {{BoundedInputStream}} wrapping the raw input (so it
> includes a fixed amount of decompressor read-ahead), others from the format's
> own bit/byte reader (read-ahead corrected). Because the read-ahead is bounded
> by a constant buffer size, the ratio {{uncompressedCount / compressedCount}}
> converges to the true ratio as the stream grows, and the absolute output
> count is always exact. Both protection strategies below are robust to this
> skew once a small grace period has elapsed.
> h2. Proposed change 1: CompressorInputStream implements InputStreamStatistics
> The base class already defines {{getUncompressedCount()}}. Add
> {{getCompressedCount()}} and declare {{implements InputStreamStatistics}}.
> All concrete subclasses except {{Pack200CompressorInputStream}} already
> override {{getCompressedCount()}}, so the base declaration should provide a
> default (return {{-1}} meaning "unknown") rather than be {{abstract}}, to
> preserve source and binary compatibility for external subclasses. Subclasses
> that have a real count keep overriding it.
> h2. Proposed change 2: configurable bomb protection on the returned streams
> Expose two independent, composable limits, settable through the existing
> builders and through {{CompressorStreamFactory}} (which already carries
> {{memoryLimitInKb}}, so this is a natural companion):
> * *Absolute output cap*: throw once {{getUncompressedCount()}} exceeds a
> configured maximum. Exact and format independent.
> * *Mean ratio guard*: once output passes a configurable grace threshold,
> throw if {{getUncompressedCount() / getCompressedCount()}} exceeds a
> configured maximum ratio. The grace period tolerates legitimately high local
> ratios at the start (headers, dictionaries, long runs) while still enforcing
> a sane mean ratio.
> h2. Gap: Pack200CompressorInputStream
> {{Pack200CompressorInputStream}} is the one exception. It overrides
> {{read(...)}} to delegate straight to an internal bridge stream, never calls
> {{count(...)}} (its Javadoc states {{getBytesRead()}} always returns 0), and
> does not implement {{InputStreamStatistics}}. To bring it under the same
> protection it must route output through {{count(...)}} and provide a real
> {{getCompressedCount()}} (counting bytes pulled from the original input).
> Otherwise it should be documented as unprotected, with
> {{getCompressedCount()}} returning {{-1}} and the ratio guard skipped (the
> absolute cap also cannot fire while output is uncounted).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)