Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3367764856


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -124,14 +122,23 @@ public void writeTo(DataOutputStream dos) throws 
IOException {
         cpBuf.asIntBuffer().put(codepointIndex);
         dos.write(cpBuf.array());
 
-        // Bigram open-addressing table (keys + values).
+        // Bigram table: sorted-occupied keys (ascending) + parallel values.
+        // Store key[0] raw, then varint (LEB128) deltas from the previous key;
+        // deltas are small because the keys are sorted and dense.
         dos.writeInt(bigramKeys.length);
         dos.writeFloat(bigramQuantMin);
         dos.writeFloat(bigramQuantMax);
-        ByteBuffer keyBuf = ByteBuffer.allocate(bigramKeys.length * 4)
-                .order(ByteOrder.BIG_ENDIAN);
-        keyBuf.asIntBuffer().put(bigramKeys);
-        dos.write(keyBuf.array());
+        if (bigramKeys.length > 0) {
+            dos.writeInt(bigramKeys[0]);
+            for (int i = 1; i < bigramKeys.length; i++) {
+                long delta = (long) bigramKeys[i] - (long) bigramKeys[i - 1];
+                if (delta < 0) {
+                    throw new IOException("bigramKeys must be sorted 
ascending; "
+                            + "non-monotonic at index " + i);
+                }

Review Comment:
   `writeTo` allows equal adjacent keys (delta==0), but `readFrom` rejects 
non-strictly-increasing sequences (`next <= prev`). If duplicate keys ever slip 
in (e.g., from training bugs or corrupted models), the writer will happily 
serialize a delta of 0 and the reader will then fail to load the model. Align 
the invariants by enforcing strictly increasing keys at write time (or relax 
the reader).



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -309,9 +308,27 @@ public NaiveBayesBigramEncodingDetector(InputStream 
modelStream) throws IOExcept
                 for (int bg = 0; bg < BIGRAM_SPACE; bg++) {
                     logP8[bg * numClasses + c] = u;
                 }
-                // Overwrite with trained pairs.
+                // Overwrite with trained pairs. Bigram ids are sorted 
ascending and
+                // stored as varint deltas (LEB128) from the previous id.
+                int bigram = 0;
                 for (int i = 0; i < vocabSize; i++) {
-                    int bigram = dis.readUnsignedShort();
+                    int delta = 0;
+                    int shift = 0;
+                    int b;
+                    do {
+                        if (shift >= 32) {
+                            throw new IOException(
+                                    "Malformed varint in bigram-id deltas (too 
long)");
+                        }
+                        b = dis.readUnsignedByte();
+                        delta |= (b & 0x7F) << shift;
+                        shift += 7;
+                    } while ((b & 0x80) != 0);
+                    bigram += delta;
+                    if (bigram < 0 || bigram >= BIGRAM_SPACE) {
+                        throw new IOException("Bigram id out of range: " + 
bigram
+                                + " (expected [0, " + BIGRAM_SPACE + "))");
+                    }

Review Comment:
   The varint delta decoder accumulates into an `int` via `delta |= (b & 0x7F) 
<< shift`. For malformed/corrupt model files, the last shift (28) can set the 
sign bit and overflow `int`, producing negative deltas and a confusing 
downstream "out of range" error. Decoding into a `long` and bounds-checking 
produces clearer failures and avoids intermediate overflow.



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/TrainJunkModel.java:
##########
@@ -1037,12 +1037,8 @@ public static BigramTables buildBigramTablesFromCounts(
         // Quantize unigram log-probs.
         QuantizedFloats qUnigram = quantizeFloats(unigramLogP);
 
-        // --- Build the open-addressing bigram table. ---
-        int slots = nextPowerOfTwo((int) Math.max(2, Math.ceil(keptPairs / 
loadFactor)));
-        int[] keys = new int[slots];
-        java.util.Arrays.fill(keys, BigramTables.EMPTY_KEY);
-        // Compute log-probs first, quantize once, then write into the table
-        // alongside its key.
+        // --- Build the sorted-occupied bigram table (binary-search lookup). 
---
+        // Compute log-probs first, quantize once, then sort by key.

Review Comment:
   This method now builds a sorted-occupied bigram table, but the `loadFactor` 
parameter (and the javadoc above) are still from the old open-addressing 
implementation and are no longer used. That makes the API misleading and can 
trigger unused-parameter warnings for callers/maintainers. Either remove 
`loadFactor` (if safe) or add an explicit note that it is retained for 
compatibility and ignored.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to