stevedlawrence commented on code in PR #1672:
URL: https://github.com/apache/daffodil/pull/1672#discussion_r3310552414


##########
daffodil-core/src/main/java/org/apache/daffodil/layers/runtime1/DeterministicGZIPOutputStream.java:
##########
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.daffodil.layers.runtime1;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.zip.Deflater;
+import java.util.zip.GZIPOutputStream;
+
+/**
+ * A GZIPOutputStream that uses Deflater BEST_COMPRESSION to produce 
+ * output that is consistent across JDKs linked against different zlib
+ * implementations (e.g. stock zlib vs. zlib-ng).
+ *
+ * According to the zlib documentation/logic from deflate.c, 
+ * compression levels range from 0 to 9:
+ *   Level 0: fastest, no compression
+ *   Level 1: some compression but faster than all other compression levels
+ *   Level 9: best compression but slowest
+ * 
+ * As the compression level number increases, speed decreases. BEST_COMPRESSION
+ * corresponds to level 9, which is at the slowest end of the spectrum.
+ *
+ * At DEFAULT_COMPRESSION(level 6) compression level, stock zlib and zlib-ng 
produce byte-different
+ * gzip output for the same input, but with BEST_COMPRESSION(level 9) they 
produce identical output.
+ */
+public final class DeterministicGZIPOutputStream extends GZIPOutputStream {
+    public DeterministicGZIPOutputStream(OutputStream out) throws IOException {
+        super(out);
+        // Force maximum compression so that stock zlib and zlib-ng converge
+        // on the same output bytes.
+        this.def.setLevel(Deflater.BEST_COMPRESSION);

Review Comment:
   I'm not sure we want to default the gzip layer to "best" compression. Even 
if it's more deterministic, it's significantly slower for files that aren't 
that much smaller. Here's a reference that shows level 9 is 10x slower for only 
16% more compression:
   
   https://devtoys.pro/sk/blog/gzip-compression-levels
   
   I think compression level 9 really only wants to be used when creating a 
file that's going to be downloaded lots of times. That way that large extra 
compression time is worth the relatively small size savings. But I don't think 
that's generally what DFDL users care about, they want things done fast even if 
it doesn't mean maximum compression.
   
   Maybe an alternatively approach is to allow passing a level to the 
constructor of this class, so instead of this being a deterministic gzip output 
stream, it's more of a configurable gzip outputs stream?
   
   Then we can use the layer parameter feature (i.e. add a 
`dfdl:defineVariable` to gzipLayer.dfdl.xsd and implement the 
`setLayerVariableParameters` function in GZIPLayer.java) so that a DFDL 
variable can be used to configure the level. The defineVariable can default to 
-1 (i.e. use implementation default), but our tests could look something like 
this to use level 9 for more deterministic compression:
   
   ```xml
   <sequence dfdlx:layer="gz:gzip">
     <annotation>
       <appinfo source="http://www.ogf.org/dfdl/";>
         <dfdl:newVariableInstance ref="gz:level" defaultValue="9"/>
       </appinfo>
     </annotation>
     ...
   </sequence>
   ```
   
   And maybe we even want to consider using compression level 0 for our tests? 
I imagine that gives us the best chance of deterministic output. It won't test 
compression at all, but I don't think that's actually all that important--we 
can pretty safely assume the GZIPOutputStream compresses correctly. Our gzip 
layer tests really just need to test that layering works correctly and creates 
data that follows the gzip format, which even compression level 0 should do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to