[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

ASF GitHub Bot (Jira) Sat, 06 Jun 2026 08:54:17 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086612#comment-18086612
 ]


ASF GitHub Bot commented on TIKA-4745:
--------------------------------------

Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3367764856


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -124,14 +122,23 @@ public void writeTo(DataOutputStream dos) throws 
IOException {
         cpBuf.asIntBuffer().put(codepointIndex);
         dos.write(cpBuf.array());
 
-        // Bigram open-addressing table (keys + values).
+        // Bigram table: sorted-occupied keys (ascending) + parallel values.
+        // Store key[0] raw, then varint (LEB128) deltas from the previous key;
+        // deltas are small because the keys are sorted and dense.
         dos.writeInt(bigramKeys.length);
         dos.writeFloat(bigramQuantMin);
         dos.writeFloat(bigramQuantMax);
-        ByteBuffer keyBuf = ByteBuffer.allocate(bigramKeys.length * 4)
-                .order(ByteOrder.BIG_ENDIAN);
-        keyBuf.asIntBuffer().put(bigramKeys);
-        dos.write(keyBuf.array());
+        if (bigramKeys.length > 0) {
+            dos.writeInt(bigramKeys[0]);
+            for (int i = 1; i < bigramKeys.length; i++) {
+                long delta = (long) bigramKeys[i] - (long) bigramKeys[i - 1];
+                if (delta < 0) {
+                    throw new IOException("bigramKeys must be sorted 
ascending; "
+                            + "non-monotonic at index " + i);
+                }

Review Comment:
   `writeTo` allows equal adjacent keys (delta==0), but `readFrom` rejects 
non-strictly-increasing sequences (`next <= prev`). If duplicate keys ever slip 
in (e.g., from training bugs or corrupted models), the writer will happily 
serialize a delta of 0 and the reader will then fail to load the model. Align 
the invariants by enforcing strictly increasing keys at write time (or relax 
the reader).



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -309,9 +308,27 @@ public NaiveBayesBigramEncodingDetector(InputStream 
modelStream) throws IOExcept
                 for (int bg = 0; bg < BIGRAM_SPACE; bg++) {
                     logP8[bg * numClasses + c] = u;
                 }
-                // Overwrite with trained pairs.
+                // Overwrite with trained pairs. Bigram ids are sorted 
ascending and
+                // stored as varint deltas (LEB128) from the previous id.
+                int bigram = 0;
                 for (int i = 0; i < vocabSize; i++) {
-                    int bigram = dis.readUnsignedShort();
+                    int delta = 0;
+                    int shift = 0;
+                    int b;
+                    do {
+                        if (shift >= 32) {
+                            throw new IOException(
+                                    "Malformed varint in bigram-id deltas (too 
long)");
+                        }
+                        b = dis.readUnsignedByte();
+                        delta |= (b & 0x7F) << shift;
+                        shift += 7;
+                    } while ((b & 0x80) != 0);
+                    bigram += delta;
+                    if (bigram < 0 || bigram >= BIGRAM_SPACE) {
+                        throw new IOException("Bigram id out of range: " + 
bigram
+                                + " (expected [0, " + BIGRAM_SPACE + "))");
+                    }

Review Comment:
   The varint delta decoder accumulates into an `int` via `delta |= (b & 0x7F) 
<< shift`. For malformed/corrupt model files, the last shift (28) can set the 
sign bit and overflow `int`, producing negative deltas and a confusing 
downstream "out of range" error. Decoding into a `long` and bounds-checking 
produces clearer failures and avoids intermediate overflow.



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/TrainJunkModel.java:
##########
@@ -1037,12 +1037,8 @@ public static BigramTables buildBigramTablesFromCounts(
         // Quantize unigram log-probs.
         QuantizedFloats qUnigram = quantizeFloats(unigramLogP);
 
-        // -

> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4745
>                 URL: https://issues.apache.org/jira/browse/TIKA-4745
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 4.0.0
>
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a 
> number of smallish things that we can clean up in the components listed in 
> the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

Reply via email to