stevedlawrence commented on code in PR #227:
URL: https://github.com/apache/daffodil-site/pull/227#discussion_r3323927284


##########
site/layers.md:
##########
@@ -608,15 +608,40 @@ The entire fixed length region of the data will be pulled 
into a byte buffer in
 
 - Name: gzip
 - Namespace URI: urn:org.apache.daffodil.layers.gzip
-- Parameter Variables: None
+- Parameter Variables
+  - `compressionLevel` an integer specifying the gzip compression level used
+    when unparsing. Valid values are:
+    - `0` (`NO_COMPRESSION`) — no compression; data is stored as-is in
+      gzip-formatted blocks.
+    - `1` (`BEST_SPEED`) — fastest compression; minimal CPU effort.
+    - `2`-`8` — intermediate levels, trading compression speed for ratio.
+    - `9` (`BEST_COMPRESSION`) — slowest compression; maximum compression
+      ratio.
+    - `-1` (`DEFAULT_COMPRESSION`) — a sentinel value, not a compression
+      level itself. The underlying zlib library interprets it as "use the
+      default level," which zlib resolves internally to level 6.
+
+    Most users should leave this at -1 (or omit setting the variable entirely)
+    to get the default level, or choose a value between 0 and 9 that fits
+    their speed-vs-compression-ratio needs.
+
+    Some reports indicate that level 9 is approximately 10x slower than the

Review Comment:
   I don't think we need this paragraph about level 9 speed vs determinism. I 
don't think that level guarantees determinism so this might not even be 
correct, and it'll just make things really really slow. In our small limited 
tests we seem to get  deterministic behavior, but it wouldn't surprise me if 
larger more complex data doesn't always compress the same.
   
   As with above, I would suggest we just stick with describing the levels and 
let users do their own research about what level to use for their use-case.



##########
site/layers.md:
##########
@@ -608,15 +608,40 @@ The entire fixed length region of the data will be pulled 
into a byte buffer in
 
 - Name: gzip
 - Namespace URI: urn:org.apache.daffodil.layers.gzip
-- Parameter Variables: None
+- Parameter Variables
+  - `compressionLevel` an integer specifying the gzip compression level used
+    when unparsing. Valid values are:
+    - `0` (`NO_COMPRESSION`) — no compression; data is stored as-is in
+      gzip-formatted blocks.
+    - `1` (`BEST_SPEED`) — fastest compression; minimal CPU effort.
+    - `2`-`8` — intermediate levels, trading compression speed for ratio.
+    - `9` (`BEST_COMPRESSION`) — slowest compression; maximum compression
+      ratio.
+    - `-1` (`DEFAULT_COMPRESSION`) — a sentinel value, not a compression
+      level itself. The underlying zlib library interprets it as "use the
+      default level," which zlib resolves internally to level 6.

Review Comment:
   Suggest we simplify this to something less verbose. I think we can also lump 
1-9 in a single bullet. We should also leave off the all caps values since 
those are really internal java identifiers that users won't care about. I'd 
also avoid mentioning zlib since different JVMs coudl use different libraries 
and its an implementation detail that I don't think users really need. We 
mention GZIPOutputStream below, so if users do need to know impelmentation 
details about what library is used they can start there as a reference point.
   
   Maybe Something like:
   
   Valid values are:
   
   `0` - no compression
   `1` to `9` - amount of compression, ranging from 1 (fastest with least 
compression) to 9 (slowest with best compression)
   `-1` - use implementation default, usually level 6. This is the default 
parameter value if the variable is not set.



##########
site/layers.md:
##########
@@ -608,15 +608,40 @@ The entire fixed length region of the data will be pulled 
into a byte buffer in
 
 - Name: gzip
 - Namespace URI: urn:org.apache.daffodil.layers.gzip
-- Parameter Variables: None
+- Parameter Variables
+  - `compressionLevel` an integer specifying the gzip compression level used
+    when unparsing. Valid values are:
+    - `0` (`NO_COMPRESSION`) — no compression; data is stored as-is in
+      gzip-formatted blocks.
+    - `1` (`BEST_SPEED`) — fastest compression; minimal CPU effort.
+    - `2`-`8` — intermediate levels, trading compression speed for ratio.
+    - `9` (`BEST_COMPRESSION`) — slowest compression; maximum compression
+      ratio.
+    - `-1` (`DEFAULT_COMPRESSION`) — a sentinel value, not a compression
+      level itself. The underlying zlib library interprets it as "use the
+      default level," which zlib resolves internally to level 6.
+
+    Most users should leave this at -1 (or omit setting the variable entirely)
+    to get the default level, or choose a value between 0 and 9 that fits
+    their speed-vs-compression-ratio needs.
+
+    Some reports indicate that level 9 is approximately 10x slower than the
+    default for only about 16% additional compression, so it is rarely the
+    right choice in production. It is primarily useful when byte-for-byte
+    deterministic output is required across JVMs linked against different zlib
+    implementations (e.g., stock zlib vs. zlib-ng), since these implementations
+    may produce byte-different output at lower levels due to differing 
match-finding
+    heuristics, but tend to converge on identical output at level 9.
 - Result Variables: None
 - Import Statement:
 ```
       <xs:import namespace="urn:org.apache.daffodil.layers.gzip"
          schemaLocation="/org/apache/daffodil/layers/xsd/gzipLayer.dfdl.xsd"/>
 ```
-This layer uses the `java.util.zip.GZIPInputStream` and 
`java.util.zip.GZIPOutputStream` 
-libraries to decode and encode.
+This layer uses `java.util.zip.GZIPInputStream` to decode and
+`ConfigurableGZIPOutputStream` (a thin wrapper around
+`java.util.zip.GZIPOutputStream`) to encode with a configurable
+compression level.

Review Comment:
   I don't think we need to mention ConfigurableGZIPOutputStream. While its 
true that's what we use, that's just an implementation detail needed to be able 
to change the level. I don't think any users would really care about that as it 
doesn't really affect the compression. At core this line is really just trying 
to describe what gzip implementation Daffodil uses so people can understand how 
it compression might behave.



##########
site/layers.md:
##########
@@ -608,15 +608,40 @@ The entire fixed length region of the data will be pulled 
into a byte buffer in
 
 - Name: gzip
 - Namespace URI: urn:org.apache.daffodil.layers.gzip
-- Parameter Variables: None
+- Parameter Variables
+  - `compressionLevel` an integer specifying the gzip compression level used
+    when unparsing. Valid values are:
+    - `0` (`NO_COMPRESSION`) — no compression; data is stored as-is in
+      gzip-formatted blocks.
+    - `1` (`BEST_SPEED`) — fastest compression; minimal CPU effort.
+    - `2`-`8` — intermediate levels, trading compression speed for ratio.
+    - `9` (`BEST_COMPRESSION`) — slowest compression; maximum compression
+      ratio.
+    - `-1` (`DEFAULT_COMPRESSION`) — a sentinel value, not a compression
+      level itself. The underlying zlib library interprets it as "use the
+      default level," which zlib resolves internally to level 6.
+
+    Most users should leave this at -1 (or omit setting the variable entirely)
+    to get the default level, or choose a value between 0 and 9 that fits
+    their speed-vs-compression-ratio needs.

Review Comment:
   I don't think we should include this paragraph. I'm not really sure what we 
should recommend to users. I think -1 is probably reasonable, but I don't 
really know. I'd suggest we just leave the description of the possible values 
above and let users decide from that, without necessarily making a 
recommendation. They can look up performance numbers and level comparisons if 
they aren't sure what to pick.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to