stevedlawrence commented on code in PR #227:
URL: https://github.com/apache/daffodil-site/pull/227#discussion_r3323927284
##########
site/layers.md:
##########
@@ -608,15 +608,40 @@ The entire fixed length region of the data will be pulled
into a byte buffer in
- Name: gzip
- Namespace URI: urn:org.apache.daffodil.layers.gzip
-- Parameter Variables: None
+- Parameter Variables
+ - `compressionLevel` an integer specifying the gzip compression level used
+ when unparsing. Valid values are:
+ - `0` (`NO_COMPRESSION`) — no compression; data is stored as-is in
+ gzip-formatted blocks.
+ - `1` (`BEST_SPEED`) — fastest compression; minimal CPU effort.
+ - `2`-`8` — intermediate levels, trading compression speed for ratio.
+ - `9` (`BEST_COMPRESSION`) — slowest compression; maximum compression
+ ratio.
+ - `-1` (`DEFAULT_COMPRESSION`) — a sentinel value, not a compression
+ level itself. The underlying zlib library interprets it as "use the
+ default level," which zlib resolves internally to level 6.
+
+ Most users should leave this at -1 (or omit setting the variable entirely)
+ to get the default level, or choose a value between 0 and 9 that fits
+ their speed-vs-compression-ratio needs.
+
+ Some reports indicate that level 9 is approximately 10x slower than the
Review Comment:
I don't think we need this paragraph about level 9 speed vs determinism. I
don't think that level guarantees determinism so this might not even be
correct, and it'll just make things really really slow. In our small limited
tests we seem to get deterministic behavior, but it wouldn't surprise me if
larger more complex data doesn't always compress the same.
As with above, I would suggest we just stick with describing the levels and
let users do their own research about what level to use for their use-case.
##########
site/layers.md:
##########
@@ -608,15 +608,40 @@ The entire fixed length region of the data will be pulled
into a byte buffer in
- Name: gzip
- Namespace URI: urn:org.apache.daffodil.layers.gzip
-- Parameter Variables: None
+- Parameter Variables
+ - `compressionLevel` an integer specifying the gzip compression level used
+ when unparsing. Valid values are:
+ - `0` (`NO_COMPRESSION`) — no compression; data is stored as-is in
+ gzip-formatted blocks.
+ - `1` (`BEST_SPEED`) — fastest compression; minimal CPU effort.
+ - `2`-`8` — intermediate levels, trading compression speed for ratio.
+ - `9` (`BEST_COMPRESSION`) — slowest compression; maximum compression
+ ratio.
+ - `-1` (`DEFAULT_COMPRESSION`) — a sentinel value, not a compression
+ level itself. The underlying zlib library interprets it as "use the
+ default level," which zlib resolves internally to level 6.
Review Comment:
Suggest we simplify this to something less verbose. I think we can also lump
1-9 in a single bullet. We should also leave off the all caps values since
those are really internal java identifiers that users won't care about. I'd
also avoid mentioning zlib since different JVMs coudl use different libraries
and its an implementation detail that I don't think users really need. We
mention GZIPOutputStream below, so if users do need to know impelmentation
details about what library is used they can start there as a reference point.
Maybe Something like:
Valid values are:
`0` - no compression
`1` to `9` - amount of compression, ranging from 1 (fastest with least
compression) to 9 (slowest with best compression)
`-1` - use implementation default, usually level 6. This is the default
parameter value if the variable is not set.
##########
site/layers.md:
##########
@@ -608,15 +608,40 @@ The entire fixed length region of the data will be pulled
into a byte buffer in
- Name: gzip
- Namespace URI: urn:org.apache.daffodil.layers.gzip
-- Parameter Variables: None
+- Parameter Variables
+ - `compressionLevel` an integer specifying the gzip compression level used
+ when unparsing. Valid values are:
+ - `0` (`NO_COMPRESSION`) — no compression; data is stored as-is in
+ gzip-formatted blocks.
+ - `1` (`BEST_SPEED`) — fastest compression; minimal CPU effort.
+ - `2`-`8` — intermediate levels, trading compression speed for ratio.
+ - `9` (`BEST_COMPRESSION`) — slowest compression; maximum compression
+ ratio.
+ - `-1` (`DEFAULT_COMPRESSION`) — a sentinel value, not a compression
+ level itself. The underlying zlib library interprets it as "use the
+ default level," which zlib resolves internally to level 6.
+
+ Most users should leave this at -1 (or omit setting the variable entirely)
+ to get the default level, or choose a value between 0 and 9 that fits
+ their speed-vs-compression-ratio needs.
+
+ Some reports indicate that level 9 is approximately 10x slower than the
+ default for only about 16% additional compression, so it is rarely the
+ right choice in production. It is primarily useful when byte-for-byte
+ deterministic output is required across JVMs linked against different zlib
+ implementations (e.g., stock zlib vs. zlib-ng), since these implementations
+ may produce byte-different output at lower levels due to differing
match-finding
+ heuristics, but tend to converge on identical output at level 9.
- Result Variables: None
- Import Statement:
```
<xs:import namespace="urn:org.apache.daffodil.layers.gzip"
schemaLocation="/org/apache/daffodil/layers/xsd/gzipLayer.dfdl.xsd"/>
```
-This layer uses the `java.util.zip.GZIPInputStream` and
`java.util.zip.GZIPOutputStream`
-libraries to decode and encode.
+This layer uses `java.util.zip.GZIPInputStream` to decode and
+`ConfigurableGZIPOutputStream` (a thin wrapper around
+`java.util.zip.GZIPOutputStream`) to encode with a configurable
+compression level.
Review Comment:
I don't think we need to mention ConfigurableGZIPOutputStream. While its
true that's what we use, that's just an implementation detail needed to be able
to change the level. I don't think any users would really care about that as it
doesn't really affect the compression. At core this line is really just trying
to describe what gzip implementation Daffodil uses so people can understand how
it compression might behave.
##########
site/layers.md:
##########
@@ -608,15 +608,40 @@ The entire fixed length region of the data will be pulled
into a byte buffer in
- Name: gzip
- Namespace URI: urn:org.apache.daffodil.layers.gzip
-- Parameter Variables: None
+- Parameter Variables
+ - `compressionLevel` an integer specifying the gzip compression level used
+ when unparsing. Valid values are:
+ - `0` (`NO_COMPRESSION`) — no compression; data is stored as-is in
+ gzip-formatted blocks.
+ - `1` (`BEST_SPEED`) — fastest compression; minimal CPU effort.
+ - `2`-`8` — intermediate levels, trading compression speed for ratio.
+ - `9` (`BEST_COMPRESSION`) — slowest compression; maximum compression
+ ratio.
+ - `-1` (`DEFAULT_COMPRESSION`) — a sentinel value, not a compression
+ level itself. The underlying zlib library interprets it as "use the
+ default level," which zlib resolves internally to level 6.
+
+ Most users should leave this at -1 (or omit setting the variable entirely)
+ to get the default level, or choose a value between 0 and 9 that fits
+ their speed-vs-compression-ratio needs.
Review Comment:
I don't think we should include this paragraph. I'm not really sure what we
should recommend to users. I think -1 is probably reasonable, but I don't
really know. I'd suggest we just leave the description of the possible values
above and let users decide from that, without necessarily making a
recommendation. They can look up performance numbers and level comparisons if
they aren't sure what to pick.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]