(commons-compress) branch master updated: feat: enforce 20-bit Huffman code-length limit consistently (#699)

ggregory Wed, 27 Aug 2025 14:12:10 -0700

This is an automated email from the ASF dual-hosted git repository.

ggregory pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/commons-compress.git



The following commit(s) were added to refs/heads/master by this push:
     new 97919fc60 feat: enforce 20-bit Huffman code-length limit consistently 
(#699)
97919fc60 is described below

commit 97919fc6010556e751e7a3aa3584f4cde058cf19
Author: Piotr P. Karwasz <[email protected]>
AuthorDate: Wed Aug 27 23:11:58 2025 +0200

    feat: enforce 20-bit Huffman code-length limit consistently (#699)
    
    * feat: enforce 20-bit Huffman code-length limit consistently
    
    This PR improves `BZip2CompressorInputStream` by enforcing the mandated 
Huffman code-length bound and right-sizing related tables. It aligns behavior 
with the reference C implementation’s intent while avoiding its historical 
over-allocation.
    
    ## Why
    
    The reference C code deliberately **over-sizes** its Huffman structures as 
a defensive programming technique: e.g., it defines `BZ_MAX_CODE_LEN = 23` and 
sizes length-indexed tables by `BZ_MAX_ALPHA_SIZE` rather than the code-length 
bound to provide safety margins. However, the bzip2 bitstream only permits code 
lengths ≤ 20 (and the 1.0.6 implementation effectively tightens this to 17), so 
carrying those defensive constants forward in our code obscures the true limit 
and keeps arrays  [...]
    
    ## What’s changed
    
    * **Set the true limit**: `MAX_CODE_LEN` now equals **20** (the format 
maximum).
    
    * **Validate inputs**: code lengths are checked explicitly; any value 
**outside `[1, 20]`** triggers a clear exception early in decoding.
    
    * **Right-size arrays**: Huffman tables and auxiliary arrays are sized to 
the minimum required for the 20-bit limit, reducing footprint and clarifying 
invariants.
    
    * **Tests**: add a unit test that exercises the maximum alphabet size 
across boundary code lengths.
    
    ## Maintenance & performance
    
    * Clearer invariants (the limit you see is the limit you enforce).
    * Slightly lower memory usage due to smaller tables.
    * Easier reasoning about bounds and indexing during review and future 
changes.
    
    ## References
    
    * 
[bzip2/huffman.c](https://github.com/libarchive/bzip2/blob/master/huffman.c)
    * 
[bzip2/bzlib_private.h](https://github.com/libarchive/bzip2/blob/master/bzlib_private.h)
      (Reference implementation uses over-sized constants for safety but 
enforces an effective 20-bit maximum.)
    
    * feat: add coverage for `recvDecodingTables` code length validation
    
    Ensure that `recvDecodingTables` rejects Huffman code lengths outside the 
valid range [1, 20]. This prevents invalid input from causing array overflows 
in `Data` during `createHuffmanDecodingTables`.
    
    The new test verifies both rejection of out-of-range values and correct 
acceptance of boundary values.
    
    * fix: simplify computation of `minLen` and `maxLen`
    
    * fix: comment the construction of Huffman tables
    
    * fix: validate indexes against actual array length
    
    Indexes were previously validated against theoretical maximum limits 
instead of the actual array size. This made the code harder to reason about and 
was likely inherited from C conventions, where array length is not directly 
available. Using the real length simplifies validation and avoids potential 
errors.
    
    * fix: failing tests
    
    * fix: limit clean-up to used range
    
    Co-authored-by: Copilot <[email protected]>
    
    * fix: initialize `inUseCount` to minimal value
    
    Initializes `inUseCount` to its minimal value to improve the clarity of the 
code.
    
    Co-authored-by: Copilot <[email protected]>
    
    * fix: remove `assertAll`
    
    * fix: improve changelog entries
    
    ---------
    
    Co-authored-by: Copilot <[email protected]>
---
 src/changes/changes.xml                            |   4 +-
 .../bzip2/BZip2CompressorInputStream.java          | 138 ++++++++++++---------
 .../compress/compressors/bzip2/BZip2Constants.java |  14 ++-
 .../bzip2/BZip2CompressorInputStreamTest.java      | 107 ++++++++++++++++
 4 files changed, 202 insertions(+), 61 deletions(-)

diff --git a/src/changes/changes.xml b/src/changes/changes.xml
index dca1fd89b..3039736e1 100644
--- a/src/changes/changes.xml
+++ b/src/changes/changes.xml
@@ -58,8 +58,8 @@ The <action> type attribute can be add,update,fix,remove.
       <action type="fix" dev="ggregory" due-to="Gary Gregory">Don't loose 
precision while reading folders from a SevenZFile.</action>
       <action type="fix" dev="ggregory" due-to="Roel van Dijk, Gary 
Gregory">Improve some exception messages in TarUtils and 
TarArchiveEntry.</action>
       <!-- FIX bzip2 -->      
-      <action type="fix" dev="ggregory" due-to="Tyler Nighswander, Gary 
Gregory">org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream
 constructors now throws CompressorException instead of 
ArrayIndexOutOfBoundsException.</action>
-      <action type="fix" dev="ggregory" due-to="Gary 
Gregory">org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream
 throws CompressorException (an IOException subclass) instead of IOException 
when it encounters formatting problems.</action>
+      <action type="fix" dev="ggregory" due-to="Tyler Nighswander, Gary 
Gregory">BZip2 input streams now throw CompressorException (a subclass of 
IOException) for invalid or corrupted data, providing more specific error 
reporting.</action>
+      <action type="fix" dev="pkarwasz" due-to="Tyler Nighswander, Piotr P. 
Karwasz">BZip2 input streams treat Huffman codes longer than 20 bits as 
corrupted data, matching the behavior of the reference implementation.</action>
       <!-- FIX dump -->      
       <action type="fix" dev="pkarwasz" due-to="Tyler Nighswander">Align DUMP 
archive block size with Linux `dump` utility.</action>
       <action type="fix" dev="ggregory" due-to="Tyler Nighswander, Gary 
Gregory">org.apache.commons.compress.archivers.dump.DumpArchiveInputStream.getNextEntry()
 throws an ArchiveException instead of ArrayIndexOutOfBoundsException.</action>
diff --git 
a/src/main/java/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.java
 
b/src/main/java/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.java
index ea21b3f80..9ea56c588 100644
--- 
a/src/main/java/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.java
+++ 
b/src/main/java/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.java
@@ -42,10 +42,13 @@
  */
 public class BZip2CompressorInputStream extends CompressorInputStream 
implements BZip2Constants, InputStreamStatistics {
 
-    private static final class Data {
+    // package private for testing
+    static final class Data {
 
         // (with blockSize 900k)
         final boolean[] inUse = new boolean[256]; // 256 byte
+        // Always equal to the number of true values in inUse[] plus 2.
+        private int inUseCount = 2;
 
         final byte[] seqToUnseq = new byte[256]; // 256 byte
         final byte[] selector = new byte[MAX_SELECTORS]; // 18002 byte
@@ -56,8 +59,10 @@ private static final class Data {
          */
         final int[] unzftab = new int[256]; // 1024 byte
 
-        final int[][] limit = new int[N_GROUPS][MAX_ALPHA_SIZE]; // 6192 byte
-        final int[][] base = new int[N_GROUPS][MAX_ALPHA_SIZE]; // 6192 byte
+        // Needs indexes from 0 to MAX_CODE_LEN inclusive.
+        final int[][] limit = new int[N_GROUPS][MAX_CODE_LEN + 1];
+        // Needs indexes from 0 to MAX_CODE_LEN + 1 inclusive.
+        final int[][] base = new int[N_GROUPS][MAX_CODE_LEN + 2];
         final int[][] perm = new int[N_GROUPS][MAX_ALPHA_SIZE]; // 6192 byte
         final int[] minLens = new int[N_GROUPS]; // 24 byte
 
@@ -155,9 +160,12 @@ private static void checkBounds(final int checkVal, final 
int limitExclusive, fi
 
     /**
      * Called by createHuffmanDecodingTables() exclusively.
+     *
+     * @param minLen minimum code length in the range [1, {@value 
MAX_CODE_LEN}] guaranteed by the caller.
+     * @param maxLen maximum code length in the range [1, {@value 
MAX_CODE_LEN}] guaranteed by the caller.
      */
     private static void hbCreateDecodeTables(final int[] limit, final int[] 
base, final int[] perm, final char[] length, final int minLen, final int maxLen,
-            final int alphaSize) throws IOException {
+            final int alphaSize) {
         for (int i = minLen, pp = 0; i <= maxLen; i++) {
             for (int j = 0; j < alphaSize; j++) {
                 if (length[j] == i) {
@@ -165,28 +173,33 @@ private static void hbCreateDecodeTables(final int[] 
limit, final int[] base, fi
                 }
             }
         }
-        for (int i = MAX_CODE_LEN; --i > 0;) {
-            base[i] = 0;
-            limit[i] = 0;
-        }
+        // Ensure the arrays were not reused.
+        Arrays.fill(base, 0);
+        Arrays.fill(limit, minLen, maxLen + 1, 0);
+        // Compute histogram of code lengths, shifted by 1.
         for (int i = 0; i < alphaSize; i++) {
             final int len = length[i] + 1;
-            checkBounds(len, MAX_ALPHA_SIZE, "length");
             base[len]++;
         }
-        for (int i = 1, b = base[0]; i < MAX_CODE_LEN; i++) {
-            b += base[i];
-            base[i] = b;
-        }
-        for (int i = minLen, vec = 0, b = base[i]; i <= maxLen; i++) {
-            final int nb = base[i + 1];
-            vec += nb - b;
-            b = nb;
-            limit[i] = vec - 1;
+        // Compute cumulative counts: base[len] = # of codes with length < len.
+        // In other terms: base[len] = index of the first code in the `perm` 
table.
+        for (int len = 1; len < base.length; len++) {
+            base[len] += base[len - 1];
+        }
+        // Compute the last code for each length.
+        int vec = 0;
+        for (int len = minLen; len <= maxLen; len++) {
+            // increment by the number of length `len` codes
+            vec += base[len + 1] - base[len];
+            // vec is now the last code of length `len` + 1
+            limit[len] = vec - 1;
             vec <<= 1;
         }
-        for (int i = minLen + 1; i <= maxLen; i++) {
-            base[i] = (limit[i - 1] + 1 << 1) - base[i];
+        // Compute the bias between code value and table index.
+        // base[minLen] cannot be computed using this rule, since limit[minLen 
- 1] does not exist,
+        // but has already the correct value 0.
+        for (int len = minLen + 1; len <= maxLen; len++) {
+            base[len] = (limit[len - 1] + 1 << 1) - base[len];
         }
     }
 
@@ -220,7 +233,6 @@ public static boolean matches(final byte[] signature, final 
int length) {
 
     private boolean blockRandomised;
     private final CRC crc = new CRC();
-    private int nInUse;
     private BitInputStream bin;
     private final boolean decompressConcatenated;
     private int currentState = START_BLOCK_STATE;
@@ -296,10 +308,15 @@ private boolean complete() throws IOException {
     }
 
     /**
-     * Called by recvDecodingTables() exclusively.
+     * Builds the Huffman decoding tables for use by {@code 
recvDecodingTables()}.
+     *
+     * @param alphaSize the alphabet size, guaranteed by the caller to be in 
the range [2, 258]
+     *                  (RUNA, RUNB, 255 byte values, and EOB).
+     * @param nGroups   the number of Huffman coding groups, guaranteed by the 
caller to be in the range [0, 6].
+     * @param dataShadow the data structure into which the tables are built; 
requires
+     *                   {@code temp_charArray2d} to be initialized.
      */
-    private void createHuffmanDecodingTables(final int alphaSize, final int 
nGroups) throws IOException {
-        final Data dataShadow = this.data;
+    static void createHuffmanDecodingTables(final int alphaSize, final int 
nGroups, final Data dataShadow) {
         final char[][] len = dataShadow.temp_charArray2d;
         final int[] minLens = dataShadow.minLens;
         final int[][] limit = dataShadow.limit;
@@ -307,10 +324,10 @@ private void createHuffmanDecodingTables(final int 
alphaSize, final int nGroups)
         final int[][] perm = dataShadow.perm;
 
         for (int t = 0; t < nGroups; t++) {
-            int minLen = 32;
-            int maxLen = 0;
             final char[] len_t = len[t];
-            for (int i = alphaSize; --i >= 0;) {
+            int minLen = len_t[0];
+            int maxLen = len_t[0];
+            for (int i = 1; i < alphaSize; i++) {
                 final char lent = len_t[i];
                 if (lent > maxLen) {
                     maxLen = lent;
@@ -341,8 +358,8 @@ private void endBlock() throws IOException {
     private void getAndMoveToFrontDecode() throws IOException {
         final BitInputStream bin = this.bin;
         this.origPtr = bsR(bin, 24);
-        recvDecodingTables();
         final Data dataShadow = this.data;
+        recvDecodingTables(bin, dataShadow);
         final byte[] ll8 = dataShadow.ll8;
         final int[] unzftab = dataShadow.unzftab;
         final byte[] selector = dataShadow.selector;
@@ -363,11 +380,12 @@ private void getAndMoveToFrontDecode() throws IOException 
{
         }
         int groupNo = 0;
         int groupPos = G_SIZE - 1;
-        final int eob = this.nInUse + 1;
+        final int eob = dataShadow.inUseCount + 1;
         int nextSym = getAndMoveToFrontDecode0();
         int lastShadow = -1;
         int zt = selector[groupNo] & 0xff;
-        checkBounds(zt, N_GROUPS, "zt");
+        // All arrays have the same length
+        checkBounds(zt, base.length, "zt");
         int[] base_zt = base[zt];
         int[] limit_zt = limit[zt];
         int[] perm_zt = perm[zt];
@@ -385,9 +403,10 @@ private void getAndMoveToFrontDecode() throws IOException {
                     }
                     if (groupPos == 0) {
                         groupPos = G_SIZE - 1;
-                        checkBounds(++groupNo, MAX_SELECTORS, "groupNo");
+                        checkBounds(++groupNo, selector.length, "groupNo");
                         zt = selector[groupNo] & 0xff;
-                        checkBounds(zt, N_GROUPS, "zt");
+                        // All arrays have the same length
+                        checkBounds(zt, base.length, "zt");
                         base_zt = base[zt];
                         limit_zt = limit[zt];
                         perm_zt = perm[zt];
@@ -396,19 +415,19 @@ private void getAndMoveToFrontDecode() throws IOException 
{
                         groupPos--;
                     }
                     int zn = minLens_zt;
-                    checkBounds(zn, MAX_ALPHA_SIZE, "zn");
+                    checkBounds(zn, limit_zt.length, "zn");
                     int zvec = bsR(bin, zn);
                     while (zvec > limit_zt[zn]) {
-                        checkBounds(++zn, MAX_ALPHA_SIZE, "zn");
+                        checkBounds(++zn, limit_zt.length, "zn");
                         zvec = zvec << 1 | bsR(bin, 1);
                     }
                     final int tmp = zvec - base_zt[zn];
-                    checkBounds(tmp, MAX_ALPHA_SIZE, "zvec");
+                    checkBounds(tmp, perm_zt.length, "zvec");
                     nextSym = perm_zt[tmp];
                 }
                 checkBounds(s, this.data.ll8.length, "s");
                 final int yy0 = yy[0];
-                checkBounds(yy0, 256, "yy");
+                checkBounds(yy0, seqToUnseq.length, "yy");
                 final byte ch = seqToUnseq[yy0];
                 unzftab[ch & 0xff] += s + 1;
                 final int from = ++lastShadow;
@@ -422,9 +441,9 @@ private void getAndMoveToFrontDecode() throws IOException {
                 if (++lastShadow >= limitLast) {
                     throw new CompressorException("Block overrun in MTF, %,d 
exceeds %,d", lastShadow, limitLast);
                 }
-                checkBounds(nextSym, 256 + 1, "nextSym");
+                checkBounds(nextSym - 1, yy.length, "nextSym");
                 final char tmp = yy[nextSym - 1];
-                checkBounds(tmp, 256, "yy");
+                checkBounds(tmp, seqToUnseq.length, "yy");
                 unzftab[seqToUnseq[tmp] & 0xff]++;
                 ll8[lastShadow] = seqToUnseq[tmp];
                 /*
@@ -440,9 +459,10 @@ private void getAndMoveToFrontDecode() throws IOException {
                 yy[0] = tmp;
                 if (groupPos == 0) {
                     groupPos = G_SIZE - 1;
-                    checkBounds(++groupNo, MAX_SELECTORS, "groupNo");
+                    checkBounds(++groupNo, selector.length, "groupNo");
                     zt = selector[groupNo] & 0xff;
-                    checkBounds(zt, N_GROUPS, "zt");
+                    // All arrays have the same length
+                    checkBounds(zt, base.length, "zt");
                     base_zt = base[zt];
                     limit_zt = limit[zt];
                     perm_zt = perm[zt];
@@ -451,14 +471,14 @@ private void getAndMoveToFrontDecode() throws IOException 
{
                     groupPos--;
                 }
                 int zn = minLens_zt;
-                checkBounds(zn, MAX_ALPHA_SIZE, "zn");
+                checkBounds(zn, limit_zt.length, "zn");
                 int zvec = bsR(bin, zn);
                 while (zvec > limit_zt[zn]) {
-                    checkBounds(++zn, MAX_ALPHA_SIZE, "zn");
+                    checkBounds(++zn, limit_zt.length, "zn");
                     zvec = zvec << 1 | bsR(bin, 1);
                 }
                 final int idx = zvec - base_zt[zn];
-                checkBounds(idx, MAX_ALPHA_SIZE, "zvec");
+                checkBounds(idx, perm_zt.length, "zvec");
                 nextSym = perm_zt[idx];
             }
         }
@@ -468,17 +488,17 @@ private void getAndMoveToFrontDecode() throws IOException 
{
     private int getAndMoveToFrontDecode0() throws IOException {
         final Data dataShadow = this.data;
         final int zt = dataShadow.selector[0] & 0xff;
-        checkBounds(zt, N_GROUPS, "zt");
+        checkBounds(zt, dataShadow.limit.length, "zt");
         final int[] limit_zt = dataShadow.limit[zt];
         int zn = dataShadow.minLens[zt];
-        checkBounds(zn, MAX_ALPHA_SIZE, "zn");
+        checkBounds(zn, limit_zt.length, "zn");
         int zvec = bsR(bin, zn);
         while (zvec > limit_zt[zn]) {
-            checkBounds(++zn, MAX_ALPHA_SIZE, "zn");
+            checkBounds(++zn, limit_zt.length, "zn");
             zvec = zvec << 1 | bsR(bin, 1);
         }
         final int tmp = zvec - dataShadow.base[zt][zn];
-        checkBounds(tmp, MAX_ALPHA_SIZE, "zvec");
+        checkBounds(tmp, dataShadow.perm[zt].length, "zvec");
         return dataShadow.perm[zt][tmp];
     }
 
@@ -573,9 +593,9 @@ private void initBlock() throws IOException {
         this.currentState = START_BLOCK_STATE;
     }
 
-    private void makeMaps() {
-        final boolean[] inUse = this.data.inUse;
-        final byte[] seqToUnseq = this.data.seqToUnseq;
+    private static void makeMaps(final Data data) throws IOException {
+        final boolean[] inUse = data.inUse;
+        final byte[] seqToUnseq = data.seqToUnseq;
 
         int nInUseShadow = 0;
 
@@ -585,7 +605,7 @@ private void makeMaps() {
             }
         }
 
-        this.nInUse = nInUseShadow;
+        data.inUseCount = nInUseShadow;
     }
 
     @Override
@@ -668,9 +688,7 @@ private int readNextByte(final BitInputStream in) throws 
IOException {
         return (int) b;
     }
 
-    private void recvDecodingTables() throws IOException {
-        final BitInputStream bin = this.bin;
-        final Data dataShadow = this.data;
+    static void recvDecodingTables(final BitInputStream bin, final Data 
dataShadow) throws IOException {
         final boolean[] inUse = dataShadow.inUse;
         final byte[] pos = dataShadow.recvDecodingTables_pos;
         final byte[] selector = dataShadow.selector;
@@ -697,8 +715,8 @@ private void recvDecodingTables() throws IOException {
             }
         }
 
-        makeMaps();
-        final int alphaSize = this.nInUse + 2;
+        makeMaps(dataShadow);
+        final int alphaSize = dataShadow.inUseCount + 2;
         /* Now the selectors */
         final int nGroups = bsR(bin, 3);
         final int selectors = bsR(bin, 15);
@@ -751,12 +769,17 @@ private void recvDecodingTables() throws IOException {
                 while (bsGetBit(bin)) {
                     curr += bsGetBit(bin) ? -1 : 1;
                 }
+                // Same condition as in bzip2
+                if (curr < 1 || curr > MAX_CODE_LEN) {
+                    throw new CompressorException(
+                            "Corrupted input, code length value out of range 
[%d, %d]: %d", 1, MAX_CODE_LEN, curr);
+                }
                 len_t[i] = (char) curr;
             }
         }
 
         // finally create the Huffman tables
-        createHuffmanDecodingTables(alphaSize, nGroups);
+        createHuffmanDecodingTables(alphaSize, nGroups, dataShadow);
     }
 
     private int setupBlock() throws IOException {
@@ -766,6 +789,7 @@ private int setupBlock() throws IOException {
 
         final int[] cftab = this.data.cftab;
         final int ttLen = this.last + 1;
+        // tt has size at least ttLen
         final int[] tt = this.data.initTT(ttLen);
         final byte[] ll8 = this.data.ll8;
         cftab[0] = 0;
diff --git 
a/src/main/java/org/apache/commons/compress/compressors/bzip2/BZip2Constants.java
 
b/src/main/java/org/apache/commons/compress/compressors/bzip2/BZip2Constants.java
index 9350d2100..592b0b527 100644
--- 
a/src/main/java/org/apache/commons/compress/compressors/bzip2/BZip2Constants.java
+++ 
b/src/main/java/org/apache/commons/compress/compressors/bzip2/BZip2Constants.java
@@ -35,9 +35,19 @@ interface BZip2Constants {
     int MAX_ALPHA_SIZE = 258;
 
     /**
-     * Constant {@value}.
+     * Maximum allowed length of a Huffman code in BZip2.
+     * <p>
+     *     In the original bzip2 C implementation, a defensive limit of 23 is 
set to prevent buffer overflows
+     *     (see <a 
href="https://github.com/libarchive/bzip2/blob/master/bzlib_private.h";>{@code 
bzip_private.h}</a>).
+     *     However, the actual enforced maximum during decompression is 20
+     *     (see <a 
href="https://github.com/libarchive/bzip2/blob/master/decompress.c";>BZ2_decompress
 in {@code decompress.c}</a>),
+     *     and since version 1.0.6, compression uses a maximum of 17.
+     * </p>
+     * <p>
+     *     The Java implementation does not require the higher defensive 
limit, so this constant is set to the true maximum of 20.
+     * </p>
      */
-    int MAX_CODE_LEN = 23;
+    int MAX_CODE_LEN = 20;
 
     /**
      * Constant {@value}.
diff --git 
a/src/test/java/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStreamTest.java
 
b/src/test/java/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStreamTest.java
index 778832668..28fb2a831 100644
--- 
a/src/test/java/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStreamTest.java
+++ 
b/src/test/java/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStreamTest.java
@@ -18,7 +18,9 @@
  */
 package org.apache.commons.compress.compressors.bzip2;
 
+import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
 import static org.junit.jupiter.api.Assertions.assertThrows;
 import static org.junit.jupiter.api.Assertions.assertTrue;
 
@@ -27,6 +29,7 @@
 import java.io.File;
 import java.io.IOException;
 import java.io.InputStream;
+import java.nio.ByteOrder;
 import java.nio.file.Files;
 import java.nio.file.Paths;
 
@@ -35,11 +38,18 @@
 import org.apache.commons.compress.archivers.ArchiveInputStream;
 import org.apache.commons.compress.archivers.ArchiveStreamFactory;
 import org.apache.commons.compress.compressors.CompressorException;
+import 
org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.Data;
+import org.apache.commons.compress.utils.BitInputStream;
 import org.apache.commons.io.IOUtils;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
 
 class BZip2CompressorInputStreamTest extends AbstractTest {
 
+    private static final int MIN_CODE_LEN = 1;
+    private static final int MAX_CODE_LEN = 20;
+
     private void fuzzingTest(final int[] bytes) throws IOException, 
ArchiveException {
         final int len = bytes.length;
         final byte[] input = new byte[len];
@@ -149,4 +159,101 @@ void testSingleByteReadConsistentlyReturnsMinusOneAtEof() 
throws IOException {
             assertEquals(-1, in.read());
         }
     }
+
+    @Test
+    void testCreateHuffmanDecodingTablesWithLargeAlphaSize() {
+        final Data data = new Data(1);
+        // Use a codeLengths array with length equal to MAX_ALPHA_SIZE (258) 
to test array bounds.
+        final char[] codeLengths = new char[258];
+        for (int i = 0; i < codeLengths.length; i++) {
+            // Use all code lengths within valid range [1, 20]
+            codeLengths[i] = (char) ((i % MAX_CODE_LEN) + 1);
+        }
+        data.temp_charArray2d[0] = codeLengths;
+        assertDoesNotThrow(
+                () -> 
BZip2CompressorInputStream.createHuffmanDecodingTables(codeLengths.length, 1, 
data),
+                "createHuffmanDecodingTables should not throw for valid 
codeLengths array of MAX_ALPHA_SIZE");
+        assertEquals(data.minLens[0], 1, "Minimum code length should be 1");
+    }
+
+    @ParameterizedTest(name = "code length {0} -> must be rejected")
+    @ValueSource(ints = {MIN_CODE_LEN - 1, MAX_CODE_LEN + 1})
+    void testRecvDecodingTablesWithOutOfRangeCodeLength(final int codeLength) 
throws IOException {
+        try (BitInputStream tables = prepareDecodingTables(codeLength)) {
+            final Data data = new Data(1);
+
+            final CompressorException ex = assertThrows(
+                    CompressorException.class,
+                    () -> 
BZip2CompressorInputStream.recvDecodingTables(tables, data),
+                    "Expected CompressorException for invalid code length " + 
codeLength);
+
+            final String msg = ex.getMessage();
+            assertNotNull(msg, "Exception message must not be null");
+            assertTrue(msg.toLowerCase().contains("code length"), "Message 
should mention 'code length'");
+            assertTrue(
+                    msg.contains("[" + MIN_CODE_LEN + ", " + MAX_CODE_LEN + 
"]"),
+                    "Message should mention valid range [" + MIN_CODE_LEN + ", 
" + MAX_CODE_LEN + "]");
+            assertTrue(
+                    msg.contains(Integer.toString(codeLength)),
+                    "Message should include the offending value " + 
codeLength);
+        }
+    }
+
+    @ParameterizedTest(name = "code length {0} -> accepted and stored")
+    @ValueSource(ints = {MIN_CODE_LEN, MAX_CODE_LEN})
+    void testRecvDecodingTablesWithValidCodeLength(final int codeLength) 
throws IOException {
+        try (BitInputStream tables = prepareDecodingTables(codeLength)) {
+            final Data data = new Data(1);
+
+            assertDoesNotThrow(
+                    () -> 
BZip2CompressorInputStream.recvDecodingTables(tables, data),
+                    "Should accept code length " + codeLength + " within [" + 
MIN_CODE_LEN + ", " + MAX_CODE_LEN + "]");
+
+            // We encoded 2 Huffman groups; both minLens should equal the 
encoded codeLength
+            assertEquals(codeLength, data.minLens[0], "Group 0 min code length 
mismatch");
+            assertEquals(codeLength, data.minLens[1], "Group 1 min code length 
mismatch");
+        }
+    }
+
+    /**
+     * Builds a minimal bitstream for recvDecodingTables():
+     * <ul>
+     *     <li>Uses only one symbol 'A' (0x41).</li>
+     *     <li>Number of groups: 2 (minimum).</li>
+     *     <li>Number of selectors: 3.</li>
+     *     <li>Selectors: all three encode j=1 (unary "10").</li>
+     *     <li>Huffman code lengths for 2 groups over alphabet size 3 (RUNA, 
RUNB, EOB) are all equal to {@code codeLength}.</li>
+     * </ul>
+     * <p>
+     *     <strong>Note:</strong> The values are chosen to keep everything 
byte-aligned.
+     * </p>
+     * @param codeLength the code length to use for each symbol in each group; 
must be in [0, 31]
+     */
+    private BitInputStream prepareDecodingTables(final int codeLength) {
+        assertTrue(0 <= codeLength && codeLength <= 31, "codeLength must be 
between 0 and 31");
+
+        final ByteArrayOutputStream stream = new ByteArrayOutputStream();
+
+        // Upper nibbles in use: set only upper nibble 4
+        stream.write(0b0000_1000);
+        stream.write(0b0000_0000);
+
+        // Lower nibbles of upper nibble 4: set only lower nibble 1
+        stream.write(0b0100_0000);
+        stream.write(0b0000_0000);
+
+        // Groups/selectors (24 bits total, byte-aligned)
+        // nGroups = 2 (3 bits), nSelectors = 3 (15 bits), selectors: 10 10 10
+        stream.write(0b010_00000); // nGroups (3 bits) + high 5 bits of 
nSelectors
+        stream.write(0b00000000); // middle 8 bits of nSelectors
+        stream.write(0b11_10_10_10); // low 2 bits of nSelectors + selectors 
(3 x 2 bits)
+
+        // Huffman tables: two groups, three symbols each
+        // startLen (5 bits) followed by 3x '0' (done) => one byte: codeLength 
<< 3
+        stream.write(codeLength << 3);
+        stream.write(codeLength << 3);
+
+        return new BitInputStream(new 
ByteArrayInputStream(stream.toByteArray()), ByteOrder.BIG_ENDIAN);
+    }
+
 }

(commons-compress) branch master updated: feat: enforce 20-bit Huffman code-length limit consistently (#699)

Reply via email to