[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086592#comment-18086592
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
Copilot commented on code in PR #2878:
URL: https://github.com/apache/tika/pull/2878#discussion_r3367380860
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -124,14 +122,18 @@ public void writeTo(DataOutputStream dos) throws
IOException {
cpBuf.asIntBuffer().put(codepointIndex);
dos.write(cpBuf.array());
- // Bigram open-addressing table (keys + values).
+ // Bigram table: sorted-occupied keys (ascending) + parallel values.
+ // Store key[0] raw, then varint (LEB128) deltas from the previous key;
+ // deltas are small because the keys are sorted and dense.
dos.writeInt(bigramKeys.length);
dos.writeFloat(bigramQuantMin);
dos.writeFloat(bigramQuantMax);
- ByteBuffer keyBuf = ByteBuffer.allocate(bigramKeys.length * 4)
- .order(ByteOrder.BIG_ENDIAN);
- keyBuf.asIntBuffer().put(bigramKeys);
- dos.write(keyBuf.array());
+ if (bigramKeys.length > 0) {
+ dos.writeInt(bigramKeys[0]);
+ for (int i = 1; i < bigramKeys.length; i++) {
+ writeVarLong(dos, (long) bigramKeys[i] - (long) bigramKeys[i -
1]);
+ }
Review Comment:
`writeVarLong` is documented as writing a non-negative value, but the delta
computed from adjacent `bigramKeys` is not validated. If the key array is
accidentally unsorted, a negative delta will be encoded as a large unsigned
varint and `readFrom` will reconstruct corrupt keys. Add an explicit delta >= 0
validation to fail fast.
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/BigramTables.java:
##########
@@ -153,9 +155,13 @@ public static BigramTables readFrom(DataInputStream dis)
throws IOException {
int slots = dis.readInt();
float bMin = dis.readFloat();
float bMax = dis.readFloat();
- byte[] keyBytes = dis.readNBytes(slots * 4);
int[] keys = new int[slots];
-
ByteBuffer.wrap(keyBytes).order(ByteOrder.BIG_ENDIAN).asIntBuffer().get(keys);
+ if (slots > 0) {
+ keys[0] = dis.readInt();
+ for (int i = 1; i < slots; i++) {
+ keys[i] = (int) (keys[i - 1] + readVarLong(dis));
+ }
Review Comment:
When reconstructing keys from varint deltas, the code casts the running sum
to `int` without validating bounds or monotonicity. A malformed/corrupt model
could overflow and silently wrap, producing unsorted keys and incorrect
binary-search results. Validate that the reconstructed key stays within the
`int` range (and ideally remains non-decreasing) and throw `IOException` on
violation.
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
> Fix For: 4.0.0
>
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)