This is an automated email from the ASF dual-hosted git repository.

stevedlawrence pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/daffodil.git


The following commit(s) were added to refs/heads/main by this push:
     new f66cafefc Add configurable gzip compression level
f66cafefc is described below

commit f66cafefc8471a93425ad61df999a40b78e3ca43
Author: Guichard Desrosiers <[email protected]>
AuthorDate: Tue May 26 19:45:54 2026 -0400

    Add configurable gzip compression level
    
    The gzip layer previously used GZIPOutputStream with the JDK's default
    compression level (level 6), with no way for schemas to choose a different
    level. This change exposes the speed-vs-size trade-off that the gzip format
    already supports, giving users control over how their data is compressed
    during unparsing.
    
    As a side benefit, this fix also addresses cross-zlib test failures
    observed on Fedora 42. Fedora's OpenJDK links against zlib-ng while
    Temurin and most other distributions bundle stock zlib. Both produce valid
    gzip output, but the deflate token streams differ due to different
    match-finding heuristics, causing tests with hardcoded byte baselines to
    fail on Fedora. For the small test data used in Daffodil's gzip tests,
    level 9 happens to produce identical output on both implementations,
    letting the tests pass regardless of which zlib is linked. This
    convergence at level 9 is empirical, not guaranteed.
    
    Changes:
    
    - Add `compressionLevel` DFDL variable to the gzip layer schema, defaulting
      to Deflater.DEFAULT_COMPRESSION (-1). Schemas can override via
      `dfdl:newVariableInstance` or `dfdl:setVariable`, and users can set the
      value externally without having to modify the schema.
    
    - Add ConfigurableGZIPOutputStream, a GZIPOutputStream subclass that allows
      the compression level to be set via constructor argument.
    
    - Update GZipLayer to accept the compressionLevel variable via
      setLayerVariableParameters and use it when constructing the output stream.
    
    - Remove GZIPFixedOutputStream and the associated fixIsNeeded() method.
      These existed to work around a pre-Java-16 bug where GZIPOutputStream
      wrote 0x00 instead of the spec-compliant 0xFF for the gzip OS header
      byte. Since Daffodil's minimum supported Java version is now 17, this
      workaround is no longer needed.
    
    - Add `-parameters` to javac options This preserves Java parameter names in
      bytecode, which is required by Daffodil's reflection-based layer parameter
      resolution. Without this, Java setters appear with parameters named 
arg0/arg1/...
      and cannot be matched to schema variables.
    
    - Update test schemas (exampleGzipLayer.dfdl.xsd, 
exampleGzipLayer2.dfdl.xsd)
      to set compressionLevel=9 via newVariableInstance.
    
    - Update TestGzipErrors.makeGZIPData to generate test input data at level 9
      for byte-stable output across JVMs, with comments documenting the
      empirical nature of the convergence.
    
    DAFFODIL-3082
---
 build.sbt                                          |   4 +
 .../runtime1/ConfigurableGZIPOutputStream.java     |  59 +++++++++
 .../apache/daffodil/layers/runtime1/GZipLayer.java | 136 +++++----------------
 .../apache/daffodil/layers/xsd/gzipLayer.dfdl.xsd  |   4 +-
 .../daffodil/core/layers/TestGzipErrors.scala      |  15 ++-
 .../apache/daffodil/core/layers/TestLayers.scala   |   6 +-
 .../daffodil/layers/runtime1/TestGzipLayer.scala   |  31 +----
 .../daffodil/layers/exampleGzipLayer.dfdl.xsd      |  11 ++
 .../daffodil/layers/exampleGzipLayer2.dfdl.xsd     |  11 ++
 9 files changed, 126 insertions(+), 151 deletions(-)

diff --git a/build.sbt b/build.sbt
index 38d910eac..b9f980174 100644
--- a/build.sbt
+++ b/build.sbt
@@ -247,6 +247,10 @@ def buildJavacOptions() = {
     "-deprecation",
     "-Xlint:dep-ann",
     "-Xlint:unchecked",
+    // Preserve Java parameter names in bytecode so Daffodil's reflection-based
+    // layer parameter resolution can match method parameters to DFDL variables
+    // by name (otherwise they appear as arg0/arg1/...).
+    "-parameters",
     "--release",
     minSupportedJavaVersion
   )
diff --git 
a/daffodil-core/src/main/java/org/apache/daffodil/layers/runtime1/ConfigurableGZIPOutputStream.java
 
b/daffodil-core/src/main/java/org/apache/daffodil/layers/runtime1/ConfigurableGZIPOutputStream.java
new file mode 100644
index 000000000..6b042fb64
--- /dev/null
+++ 
b/daffodil-core/src/main/java/org/apache/daffodil/layers/runtime1/ConfigurableGZIPOutputStream.java
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.daffodil.layers.runtime1;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.zip.GZIPOutputStream;
+
+/**
+ * A {@link GZIPOutputStream} subclass that allows the caller to specify the
+ * compression level explicitly. The standard {@code GZIPOutputStream} uses
+ * the JDK's default level (which zlib resolves to level 6), with no
+ * mechanism to choose a different level. This subclass gives callers control
+ * over the speed-vs-compression-ratio trade-off that the gzip format supports.
+ *
+ * <p>According to the zlib documentation and the logic in {@code deflate.c},
+ * compression levels range from 0 to 9:
+ * <ul>
+ *   <li>Level 0: no compression (fastest)</li>
+ *   <li>Level 1: minimal compression, fastest of the compressing levels</li>
+ *   <li>Level 9: best compression, slowest</li>
+ * </ul>
+ *
+ * <p>As the compression level increases, encoding speed decreases. Some
+ * reports indicate that level 9 is approximately 10x slower than the default
+ * for only about 16% additional compression.
+ */
+public final class ConfigurableGZIPOutputStream extends GZIPOutputStream {
+
+    /**
+     * Creates a {@code ConfigurableGZIPOutputStream} that writes compressed
+     * data to the given output stream at the specified compression level.
+     *
+     * @param out   the output stream to write compressed data to
+     * @param level the compression level; an integer in the range 0-9,
+     *              or {@link java.util.zip.Deflater#DEFAULT_COMPRESSION}
+     *              ({@code -1}) for the default
+     * @throws IOException              if an I/O error occurs
+     * @throws IllegalArgumentException if {@code level} is not a valid 
compression level
+     */
+    public ConfigurableGZIPOutputStream(OutputStream out, int level) throws 
IOException {
+        super(out);
+        this.def.setLevel(level);
+    }
+}
\ No newline at end of file
diff --git 
a/daffodil-core/src/main/java/org/apache/daffodil/layers/runtime1/GZipLayer.java
 
b/daffodil-core/src/main/java/org/apache/daffodil/layers/runtime1/GZipLayer.java
index df37f95ff..bfab2b337 100644
--- 
a/daffodil-core/src/main/java/org/apache/daffodil/layers/runtime1/GZipLayer.java
+++ 
b/daffodil-core/src/main/java/org/apache/daffodil/layers/runtime1/GZipLayer.java
@@ -22,130 +22,50 @@ import org.apache.daffodil.api.layers.Layer;
 import java.io.IOException;
 import java.io.InputStream;
 import java.io.OutputStream;
-import java.util.Objects;
 import java.util.zip.GZIPInputStream;
-import java.util.zip.GZIPOutputStream;
+import java.util.zip.Deflater;
 
 public final class GZipLayer extends Layer {
-
-  private static Boolean fixNeeded = null;
-
-  public static boolean fixIsNeeded() {
-    if (Objects.isNull(fixNeeded)) {
-      // prior to java 16
-      String versionString = System.getProperty("java.version");
-
-      // Extract the major version using string manipulation
-      int majorVersion;
-      if (versionString.startsWith("1.8")) {
-        majorVersion = 8;
-      } else {
-        String[] parts = versionString.split("\\.");
-        assert (parts.length > 0);
-        majorVersion = Integer.parseInt(parts[0]);
-      }
-      fixNeeded = (majorVersion < 16);
-    }
-    return fixNeeded;
-  }
-
+  
+  private int compressionLevel = Deflater.DEFAULT_COMPRESSION;
+  
   public GZipLayer() {
     super("gzip", "urn:org.apache.daffodil.layers.gzip");
     // convert IOExceptions to Processing Errors
     setProcessingErrorException(IOException.class);
   }
 
-  @Override
-  public InputStream wrapLayerInput(InputStream jis) throws Exception {
-    return new GZIPInputStream(jis);
-  }
-
-  @Override
-  public OutputStream wrapLayerOutput(OutputStream jos) throws Exception {
-    OutputStream fixedOS = fixIsNeeded() ? new GZIPFixedOutputStream(jos) : 
jos;
-    return new GZIPOutputStream(fixedOS);
-  }
-
-}
-
-/**
- * Prior to Java 16, the java.util.zip.GZIPOutputStream wrote a value of zero 
for
- * the OS field in the header (byte index 9).
- * In Java 16, this was changed to a
- * value of 255 to better abide by the GZIP specification. Unfortunately, this
- * means unparsed data using a GZIP layer might have a single byte difference,
- * depending on the Java version used. This can lead to inconsistent behavior 
of
- * test failures that expect a certain byte value.
- * <p>
- * To resolve this issue, we create this GZIPFixedOutputStream. This should 
wrap
- * the underlying OutputStream and be passed as the OutputStream to the
- * GZIPOutputStream. When the GZIPOutputStream writes the 9th byte to this
- * GZIPFixedOutputStream, this will always write a value of 255, making all 
Java
- * versions prior to 16 consistent with Java 16+ behavior.
- */
-class GZIPFixedOutputStream extends OutputStream {
-
-  private final OutputStream os;
-
-  public GZIPFixedOutputStream(OutputStream os) {
-    this.os = os;
-  }
-
   /**
-   * The next byte position that byte will be written to. If this is negative,
-   * that means we have already fixed the output and everything should just
-   * pass straight through.
+   * Provides the layer's parameter variables to the layer.
+   *
+   * <p>The compression level controls the trade-off between encoding speed
+   * and output size when gzip-compressing data during unparsing. Higher
+   * levels produce smaller output but take longer to encode.
+   *
+   * @param compressionLevel an integer specifying the gzip compression level
+   *                         to use when unparsing. Valid values are 0 through
+   *                         9, where 0 means no compression and 9 means 
maximum
+   *                         compression, or {@link 
Deflater#DEFAULT_COMPRESSION}
+   *                         ({@code -1}) to use the underlying zlib library's
+   *                         default (level 6).
    */
-  private int bytePosition = 0;
-
-  @Override
-  public void close() throws IOException {
-    os.close();
+  public void setLayerVariableParameters(int compressionLevel) {
+    boolean isValidRange = (compressionLevel >= 0 && compressionLevel <= 9)
+            || compressionLevel == Deflater.DEFAULT_COMPRESSION;
+    if (!isValidRange) 
+      runtimeSchemaDefinitionError("Invalid compression level: " + 
compressionLevel);
+    
+    this.compressionLevel = compressionLevel;
   }
-
+  
   @Override
-  public void flush() throws IOException {
-    os.flush();
+  public InputStream wrapLayerInput(InputStream jis) throws Exception {
+    return new GZIPInputStream(jis);
   }
 
   @Override
-  public void write(byte[] b, int off, int len) throws IOException {
-    if (bytePosition < 0) {
-      // The bad byte has been fixed, pass all writes directly through to the
-      // underlying OutputStream. This may be more efficient than the default
-      // OutputStream write() function, which writes the bytes from this array
-      // one at a time
-      os.write(b, off, len);
-    } else {
-      // The bad byte has not been fixed yet. Unless a newer version of Java
-      // has made changes, the GZIPOutputStreamm will have passed in a 10 byte
-      // array to this function that includes the bad byte. Let's just write
-      // that array using the default write(array) method that writes these
-      // bytes one at a time and will call the write(int) method that will fix
-      // that byte. Calling write() one at a time is maybe inefficient but for
-      // such a small array it should not have a noticeable effect.
-      super.write(b, off, len);
-    }
+  public OutputStream wrapLayerOutput(OutputStream jos) throws Exception {
+    return new ConfigurableGZIPOutputStream(jos, compressionLevel);
   }
 
-  @Override
-  public void write(int b) throws IOException {
-    if (bytePosition < 0) {
-      // The bad byte has already been fixed, simply pass this byte through to
-      // the underlying OutputStream
-      os.write(b);
-    } else if (bytePosition < 9) {
-      // The bad byte has not been fixed, and we haven't reached it yet, simply
-      // pass this byte through and increment our byte position
-      os.write(b);
-      bytePosition += 1;
-    } else if (bytePosition == 9) {
-      // This is the bad byte, it is a 0 on some Java versions. Write 255
-      // instead of to match Java 16+ behavior. Also, set bytePosition to -1 to
-      // signify that we have fixed the bad byte and that all other writes
-      // should just pass directly to the underlying OutputStream
-      os.write(255);
-      bytePosition = -1;
-    }
-  }
 }
diff --git 
a/daffodil-core/src/main/resources/org/apache/daffodil/layers/xsd/gzipLayer.dfdl.xsd
 
b/daffodil-core/src/main/resources/org/apache/daffodil/layers/xsd/gzipLayer.dfdl.xsd
index 30aca2ae0..d8a8a802f 100644
--- 
a/daffodil-core/src/main/resources/org/apache/daffodil/layers/xsd/gzipLayer.dfdl.xsd
+++ 
b/daffodil-core/src/main/resources/org/apache/daffodil/layers/xsd/gzipLayer.dfdl.xsd
@@ -26,8 +26,8 @@
   <annotation>
     <appinfo source="http://www.ogf.org/dfdl/";>
 
-      <!-- the gzip layer has no parameter variables
-      nor return result variables. -->
+      <!-- gzip layer variable-->
+      <dfdl:defineVariable name="compressionLevel" type="xs:int" 
defaultValue="-1" external="true"/>
 
     </appinfo>
   </annotation>
diff --git 
a/daffodil-core/src/test/scala/org/apache/daffodil/core/layers/TestGzipErrors.scala
 
b/daffodil-core/src/test/scala/org/apache/daffodil/core/layers/TestGzipErrors.scala
index 474d2b964..508f975a0 100644
--- 
a/daffodil-core/src/test/scala/org/apache/daffodil/core/layers/TestGzipErrors.scala
+++ 
b/daffodil-core/src/test/scala/org/apache/daffodil/core/layers/TestGzipErrors.scala
@@ -19,6 +19,7 @@ package org.apache.daffodil.core.layers
 
 import java.io.ByteArrayOutputStream
 import java.nio.charset.StandardCharsets
+import java.util.zip.Deflater
 import scala.xml.Elem
 
 import org.apache.daffodil.core.util.FuzzOneBits
@@ -27,6 +28,7 @@ import org.apache.daffodil.core.util.FuzzRandomSingleByte
 import org.apache.daffodil.core.util.FuzzZeroBits
 import org.apache.daffodil.core.util.LayerParseTester
 import org.apache.daffodil.core.util.TestUtils
+import org.apache.daffodil.layers.runtime1.ConfigurableGZIPOutputStream
 import org.apache.daffodil.lib.util.*
 import org.apache.daffodil.lib.xml.XMLUtils
 
@@ -85,14 +87,15 @@ a few lines of pointless text like this.""".replace("\r\n", 
"\n").replace("\n",
 
   def makeGZIPData(text: String) = {
     val baos = new ByteArrayOutputStream()
-    val gzos = new java.util.zip.GZIPOutputStream(baos)
+    // Level 9 happens to produce identical deflate output on both stock zlib 
and
+    // zlib-ng for this particular test data, letting the hardcoded baselines 
below
+    // match regardless of which implementation the JVM uses. This is 
empirical and
+    // could break with different inputs or future zlib versions.
+    val gzos = new ConfigurableGZIPOutputStream(baos, 
Deflater.BEST_COMPRESSION)
+
     IOUtils.write(text, gzos, StandardCharsets.UTF_8)
     gzos.close()
-    val data = baos.toByteArray()
-    // Java 16+ sets the 9th byte to 0xFF, but previous Java versions set the
-    // value to 0x00. Daffodil always unparses with 0xFF regardless of Java
-    // version, so force the gzip data to 0xFF to make sure tests round trip
-    data(9) = 0xff.toByte
+    val data = baos.toByteArray
     data
   }
 
diff --git 
a/daffodil-core/src/test/scala/org/apache/daffodil/core/layers/TestLayers.scala 
b/daffodil-core/src/test/scala/org/apache/daffodil/core/layers/TestLayers.scala
index fe4375130..832601b62 100644
--- 
a/daffodil-core/src/test/scala/org/apache/daffodil/core/layers/TestLayers.scala
+++ 
b/daffodil-core/src/test/scala/org/apache/daffodil/core/layers/TestLayers.scala
@@ -177,11 +177,7 @@ class TestLayers {
     val gzos = new java.util.zip.GZIPOutputStream(baos)
     IOUtils.write(text, gzos, StandardCharsets.UTF_8)
     gzos.close()
-    val data = baos.toByteArray()
-    // Java 16+ sets the 9th byte to 0xFF, but previous Java versions set the
-    // value to 0x00. Daffodil always unparses with 0xFF regardless of Java
-    // version, so force the gzip data to 0xFF to make sure tests round trip
-    data(9) = 0xff.toByte
+    val data = baos.toByteArray
     data
   }
 
diff --git 
a/daffodil-core/src/test/scala/org/apache/daffodil/layers/runtime1/TestGzipLayer.scala
 
b/daffodil-core/src/test/scala/org/apache/daffodil/layers/runtime1/TestGzipLayer.scala
index 4d770bc69..1426a16c6 100644
--- 
a/daffodil-core/src/test/scala/org/apache/daffodil/layers/runtime1/TestGzipLayer.scala
+++ 
b/daffodil-core/src/test/scala/org/apache/daffodil/layers/runtime1/TestGzipLayer.scala
@@ -17,10 +17,6 @@
 
 package org.apache.daffodil.layers.runtime1
 
-import java.io.ByteArrayOutputStream
-import java.util.zip.GZIPOutputStream
-
-import org.apache.commons.io.IOUtils
 import org.junit.Assert.*
 import org.junit.Test
 
@@ -28,32 +24,7 @@ class TestGzipLayer {
 
   val gl = new GZipLayer // default constructor has to work.
 
-  @Test def testConstruction() = {
+  @Test def testConstruction(): Unit = {
     assertNotNull(gl)
   }
-
-  /**
-   * Tests that our gzip fixer fixes up byte 9 of the gzip header
-   * so that it has 255 in it for java versions prior to Java 16
-   */
-  @Test def testByte9() = {
-    val baos = new ByteArrayOutputStream()
-    val baosRaw = new ByteArrayOutputStream()
-    val gzos = gl.wrapLayerOutput(baos)
-    val gzosRaw = new GZIPOutputStream(baosRaw)
-    val data = "testing 1, 2, 3"
-    IOUtils.write(data, gzos, "ascii")
-    IOUtils.write(data, gzosRaw, "ascii")
-    gzos.close()
-    gzosRaw.close()
-    val byte9 = baos.toByteArray.apply(9)
-    val byte9Raw = baosRaw.toByteArray.apply(9)
-    if (GZipLayer.fixIsNeeded()) {
-      assertEquals(0, byte9Raw.toInt & 0xff)
-      assertEquals(255, byte9.toInt & 0xff)
-    } else {
-      assertEquals(255, byte9Raw.toInt & 0xff)
-      assertEquals(255, byte9.toInt & 0xff)
-    }
-  }
 }
diff --git 
a/daffodil-test/src/test/resources/org/apache/daffodil/layers/exampleGzipLayer.dfdl.xsd
 
b/daffodil-test/src/test/resources/org/apache/daffodil/layers/exampleGzipLayer.dfdl.xsd
index be49050ec..455d93dce 100644
--- 
a/daffodil-test/src/test/resources/org/apache/daffodil/layers/exampleGzipLayer.dfdl.xsd
+++ 
b/daffodil-test/src/test/resources/org/apache/daffodil/layers/exampleGzipLayer.dfdl.xsd
@@ -59,6 +59,17 @@
               <element name="contents">
                 <complexType>
                   <sequence dfdlx:layer="gz:gzip">
+                    <annotation>
+                      <documentation>
+                        Level 9 (slowest) is used here only to improve the 
likelihood of
+                        deterministic gzip output across different JVMs and 
zlib implementations
+                        during testing. Real-world users should either omit 
this newVariableInstance
+                        to use the default level (6) or choose a level 
appropriate for their needs.
+                      </documentation>
+                      <appinfo source="http://www.ogf.org/dfdl/";>
+                        <dfdl:newVariableInstance ref="gz:compressionLevel" 
defaultValue="9"/>
+                      </appinfo>
+                    </annotation>
                     <group ref="ex:compressedGroupContents"/>
                   </sequence>
                 </complexType>
diff --git 
a/daffodil-test/src/test/resources/org/apache/daffodil/layers/exampleGzipLayer2.dfdl.xsd
 
b/daffodil-test/src/test/resources/org/apache/daffodil/layers/exampleGzipLayer2.dfdl.xsd
index 976492ca8..62d48db9c 100644
--- 
a/daffodil-test/src/test/resources/org/apache/daffodil/layers/exampleGzipLayer2.dfdl.xsd
+++ 
b/daffodil-test/src/test/resources/org/apache/daffodil/layers/exampleGzipLayer2.dfdl.xsd
@@ -93,6 +93,17 @@
                          now the gzipped layered sequence itself
                        -->
                       <sequence dfdlx:layer="gz:gzip">
+                        <annotation>
+                          <documentation>
+                            Level 9 (slowest) is used here only to improve the 
likelihood of
+                            deterministic gzip output across different JVMs 
and zlib implementations
+                            during testing. Real-world users should either 
omit this newVariableInstance
+                            to use the default level (6) or choose a level 
appropriate for their needs.
+                          </documentation>
+                          <appinfo source="http://www.ogf.org/dfdl/";>
+                            <dfdl:newVariableInstance 
ref="gz:compressionLevel" defaultValue="9"/>
+                          </appinfo>
+                        </annotation>
                         <!--
                           finally, inside that, we have the original 
fileTypeGroup group reference.
                           -->

Reply via email to