[ 
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086592#comment-18086592
 ] 

ASF GitHub Bot commented on TIKA-4745:
--------------------------------------

Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3367380860


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -124,14 +122,18 @@ public void writeTo(DataOutputStream dos) throws 
IOException {
         cpBuf.asIntBuffer().put(codepointIndex);
         dos.write(cpBuf.array());
 
-        // Bigram open-addressing table (keys + values).
+        // Bigram table: sorted-occupied keys (ascending) + parallel values.
+        // Store key[0] raw, then varint (LEB128) deltas from the previous key;
+        // deltas are small because the keys are sorted and dense.
         dos.writeInt(bigramKeys.length);
         dos.writeFloat(bigramQuantMin);
         dos.writeFloat(bigramQuantMax);
-        ByteBuffer keyBuf = ByteBuffer.allocate(bigramKeys.length * 4)
-                .order(ByteOrder.BIG_ENDIAN);
-        keyBuf.asIntBuffer().put(bigramKeys);
-        dos.write(keyBuf.array());
+        if (bigramKeys.length > 0) {
+            dos.writeInt(bigramKeys[0]);
+            for (int i = 1; i < bigramKeys.length; i++) {
+                writeVarLong(dos, (long) bigramKeys[i] - (long) bigramKeys[i - 
1]);
+            }

Review Comment:
   `writeVarLong` is documented as writing a non-negative value, but the delta 
computed from adjacent `bigramKeys` is not validated. If the key array is 
accidentally unsorted, a negative delta will be encoded as a large unsigned 
varint and `readFrom` will reconstruct corrupt keys. Add an explicit delta >= 0 
validation to fail fast.



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -153,9 +155,13 @@ public static BigramTables readFrom(DataInputStream dis) 
throws IOException {
         int slots = dis.readInt();
         float bMin = dis.readFloat();
         float bMax = dis.readFloat();
-        byte[] keyBytes = dis.readNBytes(slots * 4);
         int[] keys = new int[slots];
-        
ByteBuffer.wrap(keyBytes).order(ByteOrder.BIG_ENDIAN).asIntBuffer().get(keys);
+        if (slots > 0) {
+            keys[0] = dis.readInt();
+            for (int i = 1; i < slots; i++) {
+                keys[i] = (int) (keys[i - 1] + readVarLong(dis));
+            }

Review Comment:
   When reconstructing keys from varint deltas, the code casts the running sum 
to `int` without validating bounds or monotonicity. A malformed/corrupt model 
could overflow and silently wrap, producing unsorted keys and incorrect 
binary-search results. Validate that the reconstructed key stays within the 
`int` range (and ideally remains non-decreasing) and throw `IOException` on 
violation.





> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4745
>                 URL: https://issues.apache.org/jira/browse/TIKA-4745
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 4.0.0
>
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a 
> number of smallish things that we can clean up in the components listed in 
> the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to