tika-eval improvements [tika]

via GitHub Wed, 03 Jun 2026 11:42:36 -0700


tballison commented on code in PR #2861:
URL: https://github.com/apache/tika/pull/2861#discussion_r3351020518



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java:
##########
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.nio.ByteBuffer;
+import java.nio.CharBuffer;
+import java.nio.charset.Charset;
+import java.nio.charset.CharsetDecoder;
+import java.nio.charset.CoderResult;
+import java.nio.charset.CodingErrorAction;
+import java.util.Locale;
+
+import org.apache.tika.detect.CharsetSupersets;
+
+/**
+ * Structural false-CJK veto: measures how badly a probe fails to decode under 
a
+ * legacy multi-byte CJK charset, robustly against embedded UTF-8.
+ *
+ * <p>A Latin/Cyrillic/garbage page mis-detected as a legacy CJK charset 
decodes
+ * with many malformed/unmappable sequences; real CJK decodes cleanly.  Two
+ * corrections make the rate meaningful (see the findings doc):
+ * <ol>
+ *   <li>decode under the <em>vendor superset</em> ({@link CharsetSupersets}) 
so
+ *       real vendor-extension chars aren't counted as failures;</li>
+ *   <li><strong>discount embedded UTF-8</strong> — mixed-encoding pages 
(legacy
+ *       CJK body + UTF-8 widgets) would otherwise inflate the rate.  
Post-discount,
+ *       real CJK (pure or mixed) is ≤1.6% while genuine false-CJK stays 
≥5.3%.</li>
+ * </ol>
+ *
+ * <p>The discount is done by a <em>UTF-8-aware single pass</em>, NOT by 
physically
+ * stripping UTF-8 runs: a real legacy-CJK char can coincidentally match UTF-8
+ * grammar (e.g. Shift_JIS kanji with lead 0xE0–0xEA), and physically removing 
it
+ * would misalign the stream and manufacture failures on genuine CJK.  Instead 
we
+ * walk the bytes, skip positions that begin a valid UTF-8 sequence, and 
decode the
+ * legacy charset in place everywhere else — so real CJK is never misaligned 
and
+ * the rate errs toward <em>not</em> vetoing.
+ *
+ * <p>Does NOT catch the legal-but-wrong class (Latin bytes that form 
<em>valid</em>
+ * CJK at ~0 failure) — that's the typicality layer's job.
+ */
+public final class CjkDecodeValidator {
+
+    private CjkDecodeValidator() {
+    }
+
+    /** Minimum legacy (non-UTF-8) high bytes required before the rate is 
trusted. */
+    public static final int MIN_HIGH_BYTES = 30;
+
+    /**
+     * Failure rate of {@code bytes} under {@code cjkCharset}'s vendor 
superset,
+     * counting only legacy high bytes (embedded UTF-8 is skipped, not 
counted).
+     *
+     * @return failures / legacy-high-bytes, or {@code -1.0} when there is too
+     *         little legacy evidence (legacy high bytes &lt; {@link 
#MIN_HIGH_BYTES})
+     */
+    public static double strippedFailureRate(byte[] bytes, Charset cjkCharset) 
{
+        Charset decodeAs = CharsetSupersets.supersetOf(cjkCharset);
+        if (decodeAs == null) {
+            decodeAs = cjkCharset;
+        }
+        CharsetDecoder dec = decodeAs.newDecoder()
+                .onMalformedInput(CodingErrorAction.REPORT)
+                .onUnmappableCharacter(CodingErrorAction.REPORT);
+        CharBuffer one = CharBuffer.allocate(1);
+        int i = 0;
+        int n = bytes.length;
+        int fail = 0;
+        int nHigh = 0;
+        while (i < n) {
+            int x = bytes[i] & 0xFF;
+            if (x < 0x80) {
+                i++;
+                continue;
+            }
+            int ulen = utf8SequenceLength(bytes, i);
+            if (ulen > 0) {
+                i += ulen; // embedded UTF-8 — not legacy content, skip
+                continue;
+            }
+            nHigh++;
+            dec.reset();
+            one.clear();
+            ByteBuffer in = ByteBuffer.wrap(bytes, i, Math.min(4, n - i));
+            CoderResult r = dec.decode(in, one, true);
+            if (r.isError()) {
+                fail++;
+                i++;
+            } else {
+                int consumed = in.position() - i;
+                i += Math.max(1, consumed);
+            }

Review Comment:
   Claude: This one's a false alarm. ByteBuffer.wrap(byte[], offset, length) is 
the three-arg form: it returns a buffer whose position is offset, not 0 
(capacity = 
     ▎ array.length, limit = offset + length). Only the single-arg wrap(byte[]) 
starts at position 0. So after decode, in.position() is i + bytesConsumed, and 
     ▎ consumed = in.position() - i is correct.
     ▎
     ▎ Verified directly with bytes = {…,0xC4,0xE3,…}, i = 3: initial position 
is 3 (not 0); after decoding the 2-byte GB18030 sequence position is 5, so 
     ▎ in.position() - i = 2 — correct. Using the raw in.position() as you 
suggest would yield 5 and over-advance i. Leaving as-is.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] TIKA-4745 - charset/junk/tika-eval improvements [tika]

Reply via email to