On Sat, 16 May 2026 07:21:09 GMT, Jaikiran Pai <[email protected]> wrote:
>> Can I please get a review of this doc and test-only change for >> https://bugs.openjdk.org/browse/JDK-8322256? >> >> The `java.util.zip.GZIPInputStream` has been in the JDK since Java 1.1. >> However, its specification hasn't been clear on how it behaves, especially >> when a `InputStream` consists of more than one GZIP member. >> >> The commit in this PR updates the specification of this class to match its >> current (long standing) implementation. >> >> A new jtreg test has been introduced to verify this behaviour. >> >> I'll draft a CSR shortly. >> >> --------- >> - [x] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Jaikiran Pai has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 19 additional > commits since the last revision: > > - Alan's review - update the class level doc > - merge latest from master branch > - merge latest from master branch > - merge latest from master branch > - minor adjustments > - Lance's review > - merge latest from master branch > - introduce test to verify read() throws IOException when invoked on closed > stream > - specify existing behaviour of throwing IOException when stream is already > closed > - merge latest from master branch > - ... and 9 more: https://git.openjdk.org/jdk/compare/b7022c4e...40a57962 I am late to this issue, but I think it is important to get this right and I'm not happy with the current solution. I think we all agree that according to RFC 1952 specification, a GZIP stream can be composed of several, independent "members" (i.e. compressed data sets) and the `java.util.zip` package documentation has always clearly stated that it provides "classes for reading and writing the standard ZIP and GZIP file formats" according to RFC 1952. Now we all know that the `GZIPInputStream` implementation was wrong and under some circumstances `GZIPInputStream::read()` could return early and ignore existing, concatenated "compressed members" ([JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)). This problem has been fixed in JDK 23 and we have downported the fix to OpenJDK 17 and 21 because we have seen a silent data loss in the [Ion](https://amazon-ion.github.io/ion-docs/) open source serialization library which is heavily used within Amazon. Unfortunately, the fix for [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144) introduced a performance regression for the common case of streams which only contain a single "compressed member", but that was fixed with [JDK-8374644](https://bugs.openjdk.org/browse/JDK-8374644). Then, another regression was reported under [JDK-8377896](https://bugs.openjdk.org/browse/JDK-8377896) and from my current understanding, that last regression was the reason for completely backing out the fix (i.e. [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)) of the initial problem as well as the fix for the regression (i.e. [JDK-8374644](https://bugs.openjdk.org/browse/JDK-8374644)) introduced by the primary fix. When reading the description of [JDK-8377896](https://bugs.openjdk.org/browse/JDK-8377896) (i.e. the last regression report) I don't really think that this is an issue with `GZIPInputStream`'s support for multiple "compressed members", but rather an issue with a server which doesn't properly closes a connection after it has written the last byte to it. I think it is much more important, that a client can correctly read the **complete** payload it was sent rather then terminating the read operation quickly at the risk of missing some trailing data from the payload. The regression reported in [JDK-8377896](https://bugs.openjdk.org/browse/JDK-8377896) can be easily fixed on the server side by immediately closing the connection once the first "compressed member" was written to the connection *if and only if* the payload really consist of just a single "compressed member". However, if more "compressed members" will be written to that connection, it is right and the only correct implementa tion for `GZIPInputStream::read()` to block until the remaining "compressed members" become available as well. In my opinion, the JBS issue for this PR (i.e. [JDK-8322256](https://bugs.openjdk.org/browse/JDK-8322256)) suggests one possible way to fix this issue: - Add new method setEnableConcatenatedStreams(boolean), default true - When concatenated streams disabled, stop after reading a stream trailer - When concatenated streams enabled, throw ZipException if there is any data after a trailer but it cannot be successfully interpreted as a next header. But instead of implementing that suggestion, this PR only updates the documentation, but still leaves the crucial part of this problem, namely the question whether `GZIPInputStream` supports multiple "compressed members" undefined. I would even argue, that with the proposed change, the situation becomes even more confusing, because first we claim that `GZIPInputStream` is fully RFC 1952 compatible and does support the reading of multiple "compressed members" but then, in a "hidden" `@implSpec` we say that the implementation uses `InputStream::available()` to check for additional, concatenated data members. I think the vast majority of readers of this `@implSpec` will not understand that the concrete implementation of `InputStream::available()` will affect `GZIPInputStream`'s ability to correctly read concatenated GZIP streams and in reality, users might not even be able to know how the concrete implementation of `InputStream::available()` looks like for the derived input stream imple mentation they are wrapping in their `GZIPInputStream`. So while I think the suggestions from [JDK-8322256](https://bugs.openjdk.org/browse/JDK-8322256) posted above look interesting, they have the problem that they introduce a new API which can't be downported to earlier JDK versions. But I think the silent payload truncation in `GZIPInputStream` is important to fix because it impacts real production code. I would therefor like to propose the following solution: 1. Leave the fix for [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144) and the follow-up performance fix [JDK-8374644](https://bugs.openjdk.org/browse/JDK-8374644) in the JDK (i.e. revert the backout [JDK-8381670](https://bugs.openjdk.org/browse/JDK-8381670)). 2. Instead of the the proposed `@implSpec` in this PR, write a different one which explains that `GZIPInputStream::read()` will try to read beyond a GZIP members's trailer and that this read-ahead can block if the underlying input stream isn't properly closed (the other documentation changes look good to me). Finally, to make a long story short, I think my main argument is that we should prioritize correctness (i.e. avoidance of silent payload truncation) over performance. ------------- PR Comment: https://git.openjdk.org/jdk/pull/30925#issuecomment-4519502667
