Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3367380860
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -124,14 +122,18 @@ public void writeTo(DataOutputStream dos) throws
IOException {
cpBuf.asIntBuffer().put(codepointIndex);
dos.write(cpBuf.array());
- // Bigram open-addressing table (keys + values).
+ // Bigram table: sorted-occupied keys (ascending) + parallel values.
+ // Store key[0] raw, then varint (LEB128) deltas from the previous key;
+ // deltas are small because the keys are sorted and dense.
dos.writeInt(bigramKeys.length);
dos.writeFloat(bigramQuantMin);
dos.writeFloat(bigramQuantMax);
- ByteBuffer keyBuf = ByteBuffer.allocate(bigramKeys.length * 4)
- .order(ByteOrder.BIG_ENDIAN);
- keyBuf.asIntBuffer().put(bigramKeys);
- dos.write(keyBuf.array());
+ if (bigramKeys.length > 0) {
+ dos.writeInt(bigramKeys[0]);
+ for (int i = 1; i < bigramKeys.length; i++) {
+ writeVarLong(dos, (long) bigramKeys[i] - (long) bigramKeys[i -
1]);
+ }
Review Comment:
`writeVarLong` is documented as writing a non-negative value, but the delta
computed from adjacent `bigramKeys` is not validated. If the key array is
accidentally unsorted, a negative delta will be encoded as a large unsigned
varint and `readFrom` will reconstruct corrupt keys. Add an explicit delta >= 0
validation to fail fast.
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -153,9 +155,13 @@ public static BigramTables readFrom(DataInputStream dis)
throws IOException {
int slots = dis.readInt();
float bMin = dis.readFloat();
float bMax = dis.readFloat();
- byte[] keyBytes = dis.readNBytes(slots * 4);
int[] keys = new int[slots];
-
ByteBuffer.wrap(keyBytes).order(ByteOrder.BIG_ENDIAN).asIntBuffer().get(keys);
+ if (slots > 0) {
+ keys[0] = dis.readInt();
+ for (int i = 1; i < slots; i++) {
+ keys[i] = (int) (keys[i - 1] + readVarLong(dis));
+ }
Review Comment:
When reconstructing keys from varint deltas, the code casts the running sum
to `int` without validating bounds or monotonicity. A malformed/corrupt model
could overflow and silently wrap, producing unsorted keys and incorrect
binary-search results. Validate that the reconstructed key stays within the
`int` range (and ideally remains non-decreasing) and throw `IOException` on
violation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]