Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v12]

Volker Simonis Fri, 22 May 2026 07:20:21 -0700

On Sat, 16 May 2026 07:21:09 GMT, Jaikiran Pai <[email protected]> wrote:


>> Can I please get a review of this doc and test-only change for 
>> https://bugs.openjdk.org/browse/JDK-8322256?
>> 
>> The `java.util.zip.GZIPInputStream` has been in the JDK since Java 1.1. 
>> However, its specification hasn't been clear on how it behaves, especially 
>> when a `InputStream` consists of more than one GZIP member. 
>> 
>> The commit in this PR updates the specification of this class to match its 
>> current (long standing) implementation.
>> 
>> A new jtreg test has been introduced to verify this behaviour.
>> 
>> I'll draft a CSR shortly.
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Jaikiran Pai has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 19 additional 
> commits since the last revision:
> 
>  - Alan's review - update the class level doc
>  - merge latest from master branch
>  - merge latest from master branch
>  - merge latest from master branch
>  - minor adjustments
>  - Lance's review
>  - merge latest from master branch
>  - introduce test to verify read() throws IOException when invoked on closed 
> stream
>  - specify existing behaviour of throwing IOException when stream is already 
> closed
>  - merge latest from master branch
>  - ... and 9 more: https://git.openjdk.org/jdk/compare/b7022c4e...40a57962

I am late to this issue, but I think it is important to get this right and I'm 
not happy with the current solution. I think we all agree that according to RFC 
1952 specification, a GZIP stream can be composed of several, independent 
"members" (i.e. compressed data sets) and the `java.util.zip` package 
documentation has always clearly stated that it provides "classes for reading 
and writing the standard ZIP and GZIP file formats" according to RFC 1952.

Now we all know that the `GZIPInputStream` implementation was wrong and under 
some circumstances `GZIPInputStream::read()` could return early and ignore 
existing, concatenated "compressed members" 
([JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)). This problem has 
been fixed in JDK 23 and we have downported the fix to OpenJDK 17 and 21 
because we have seen a silent data loss in the 
[Ion](https://amazon-ion.github.io/ion-docs/) open source serialization library 
which is heavily used within Amazon.

Unfortunately, the fix for 
[JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144) introduced a 
performance regression for the common case of streams which only contain a 
single "compressed member", but that was fixed with 
[JDK-8374644](https://bugs.openjdk.org/browse/JDK-8374644). Then, another 
regression was reported under 
[JDK-8377896](https://bugs.openjdk.org/browse/JDK-8377896) and from my current 
understanding, that last regression was the reason for completely backing out 
the fix (i.e. [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)) of 
the initial problem as well as the fix for the regression (i.e. 
[JDK-8374644](https://bugs.openjdk.org/browse/JDK-8374644)) introduced by the 
primary fix.

When reading the description of 
[JDK-8377896](https://bugs.openjdk.org/browse/JDK-8377896) (i.e. the last 
regression report) I don't really think that this is an issue with 
`GZIPInputStream`'s support for multiple "compressed members", but rather an 
issue with a server which doesn't properly closes a connection after it has 
written the last byte to it. I think it is much more important, that a client 
can correctly read the **complete** payload it was sent rather then terminating 
the read operation quickly at the risk of missing some trailing data from the 
payload. The regression reported in 
[JDK-8377896](https://bugs.openjdk.org/browse/JDK-8377896) can be easily fixed 
on the server side by immediately closing the connection once the first 
"compressed member" was written to the connection *if and only if* the payload 
really consist of just a single "compressed member". However, if more 
"compressed members" will be written to that connection, it is right and the 
only correct implementa
 tion for `GZIPInputStream::read()` to block until the remaining "compressed 
members" become available as well.

In my opinion, the JBS issue for this PR (i.e. 
[JDK-8322256](https://bugs.openjdk.org/browse/JDK-8322256)) suggests one 
possible way to fix this issue:

- Add new method setEnableConcatenatedStreams(boolean), default true
- When concatenated streams disabled, stop after reading a stream trailer
- When concatenated streams enabled, throw ZipException if there is any data 
after a trailer but it cannot be successfully interpreted as a next header.

But instead of implementing that suggestion, this PR only updates the 
documentation, but still leaves the crucial part of this problem, namely the 
question whether `GZIPInputStream` supports multiple "compressed members" 
undefined. I would even argue, that with the proposed change, the situation 
becomes even more confusing, because first we claim that `GZIPInputStream` is 
fully RFC 1952 compatible and does support the reading of multiple "compressed 
members" but then, in a "hidden" `@implSpec` we say that the implementation 
uses `InputStream::available()` to check for additional, concatenated data 
members. I think the vast majority of readers of this `@implSpec` will not 
understand that the concrete implementation of `InputStream::available()` will 
affect `GZIPInputStream`'s ability to correctly read concatenated GZIP streams 
and in reality, users might not even be able to know how the concrete 
implementation of `InputStream::available()` looks like for the derived input 
stream imple
 mentation they are wrapping in their `GZIPInputStream`.

So while I think the suggestions from 
[JDK-8322256](https://bugs.openjdk.org/browse/JDK-8322256) posted above look 
interesting, they have the problem that they introduce a new API which can't be 
downported to earlier JDK versions. But I think the silent payload truncation 
in  `GZIPInputStream` is important to fix because it impacts real production 
code. I would therefor like to propose the following solution:

1. Leave the fix for [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144) 
and the follow-up performance fix 
[JDK-8374644](https://bugs.openjdk.org/browse/JDK-8374644) in the JDK (i.e. 
revert the backout [JDK-8381670](https://bugs.openjdk.org/browse/JDK-8381670)).
2. Instead of the the proposed `@implSpec`  in this PR, write a different one 
which explains that `GZIPInputStream::read()` will try to read beyond a GZIP 
members's trailer and that this read-ahead can block if the underlying input 
stream isn't properly closed (the other documentation changes look good to me).

Finally, to make a long story short, I think my main argument is that we should 
prioritize correctness (i.e. avoidance of silent payload truncation) over 
performance.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30925#issuecomment-4519502667

Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v12]

Reply via email to