tika-eval improvements [tika]

via GitHub Wed, 03 Jun 2026 10:35:54 -0700


Copilot commented on code in PR #2861:
URL: https://github.com/apache/tika/pull/2861#discussion_r3350627629



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java:
##########
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.nio.ByteBuffer;
+import java.nio.CharBuffer;
+import java.nio.charset.Charset;
+import java.nio.charset.CharsetDecoder;
+import java.nio.charset.CoderResult;
+import java.nio.charset.CodingErrorAction;
+import java.util.Locale;
+
+import org.apache.tika.detect.CharsetSupersets;
+
+/**
+ * Structural false-CJK veto: measures how badly a probe fails to decode under 
a
+ * legacy multi-byte CJK charset, robustly against embedded UTF-8.
+ *
+ * <p>A Latin/Cyrillic/garbage page mis-detected as a legacy CJK charset 
decodes
+ * with many malformed/unmappable sequences; real CJK decodes cleanly.  Two
+ * corrections make the rate meaningful (see the findings doc):
+ * <ol>
+ *   <li>decode under the <em>vendor superset</em> ({@link CharsetSupersets}) 
so
+ *       real vendor-extension chars aren't counted as failures;</li>
+ *   <li><strong>discount embedded UTF-8</strong> — mixed-encoding pages 
(legacy
+ *       CJK body + UTF-8 widgets) would otherwise inflate the rate.  
Post-discount,
+ *       real CJK (pure or mixed) is ≤1.6% while genuine false-CJK stays 
≥5.3%.</li>
+ * </ol>
+ *
+ * <p>The discount is done by a <em>UTF-8-aware single pass</em>, NOT by 
physically
+ * stripping UTF-8 runs: a real legacy-CJK char can coincidentally match UTF-8
+ * grammar (e.g. Shift_JIS kanji with lead 0xE0–0xEA), and physically removing 
it
+ * would misalign the stream and manufacture failures on genuine CJK.  Instead 
we
+ * walk the bytes, skip positions that begin a valid UTF-8 sequence, and 
decode the
+ * legacy charset in place everywhere else — so real CJK is never misaligned 
and
+ * the rate errs toward <em>not</em> vetoing.
+ *
+ * <p>Does NOT catch the legal-but-wrong class (Latin bytes that form 
<em>valid</em>
+ * CJK at ~0 failure) — that's the typicality layer's job.
+ */
+public final class CjkDecodeValidator {
+
+    private CjkDecodeValidator() {
+    }
+
+    /** Minimum legacy (non-UTF-8) high bytes required before the rate is 
trusted. */
+    public static final int MIN_HIGH_BYTES = 30;
+
+    /**
+     * Failure rate of {@code bytes} under {@code cjkCharset}'s vendor 
superset,
+     * counting only legacy high bytes (embedded UTF-8 is skipped, not 
counted).
+     *
+     * @return failures / legacy-high-bytes, or {@code -1.0} when there is too
+     *         little legacy evidence (legacy high bytes &lt; {@link 
#MIN_HIGH_BYTES})
+     */
+    public static double strippedFailureRate(byte[] bytes, Charset cjkCharset) 
{
+        Charset decodeAs = CharsetSupersets.supersetOf(cjkCharset);
+        if (decodeAs == null) {
+            decodeAs = cjkCharset;
+        }
+        CharsetDecoder dec = decodeAs.newDecoder()
+                .onMalformedInput(CodingErrorAction.REPORT)
+                .onUnmappableCharacter(CodingErrorAction.REPORT);
+        CharBuffer one = CharBuffer.allocate(1);
+        int i = 0;
+        int n = bytes.length;
+        int fail = 0;
+        int nHigh = 0;
+        while (i < n) {
+            int x = bytes[i] & 0xFF;
+            if (x < 0x80) {
+                i++;
+                continue;
+            }
+            int ulen = utf8SequenceLength(bytes, i);
+            if (ulen > 0) {
+                i += ulen; // embedded UTF-8 — not legacy content, skip
+                continue;
+            }
+            nHigh++;
+            dec.reset();
+            one.clear();
+            ByteBuffer in = ByteBuffer.wrap(bytes, i, Math.min(4, n - i));
+            CoderResult r = dec.decode(in, one, true);
+            if (r.isError()) {
+                fail++;
+                i++;
+            } else {
+                int consumed = in.position() - i;
+                i += Math.max(1, consumed);
+            }

Review Comment:
   `consumed` is computed incorrectly. `ByteBuffer.wrap(bytes, i, ...)` returns 
a buffer whose position starts at 0, so `in.position()` after decoding is 
already the number of bytes consumed. Subtracting `i` makes `consumed` 
wrong/negative for `i>0`, which will mis-advance `i` and can massively inflate 
the failure rate (likely vetoing real CJK).



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java:
##########
@@ -274,6 +347,153 @@ public List<EncodingResult> detect(TikaInputStream tis, 
Metadata metadata,
         return List.of(new EncodingResult(champion, confidence));
     }
 
+    /** Minimum diff-z margin by which the other family must beat the 
champion's
+     *  family before the family gate overrides.  Large enough to ignore the
+     *  noise-level boundary ties; real CJK-vs-garbage diffs are far larger. */
+    private static final double FAMILY_DIFF_MARGIN = 2.0;
+
+    private static boolean isCjkCharset(String name) {
+        String n = name.toLowerCase(java.util.Locale.ROOT);
+        return n.contains("gb") || n.contains("big5") || n.contains("euc")
+                || n.contains("shift") || n.contains("jis") || 
n.contains("2022")
+                || n.contains("949");
+    }
+
+    /** Highest whole-text-z candidate within the requested family (CJK or 
not). */
+    private static Charset bestInFamily(Map<Charset, Double> wholeZ, boolean 
cjk) {
+        Charset best = null;
+        double bz = Double.NEGATIVE_INFINITY;
+        for (Map.Entry<Charset, Double> e : wholeZ.entrySet()) {
+            if (isCjkCharset(e.getKey().name()) == cjk && e.getValue() > bz) {
+                bz = e.getValue();
+                best = e.getKey();
+            }
+        }
+        return best;
+    }
+
+    /** Script-letter "diff" content: codepoints &ge; 0x80 that are letters/
+     *  ideographs — the high bytes where candidate decodes differ.  Shared 
ASCII
+     *  and non-ASCII punctuation/symbols are dropped (they dilute toward a
+     *  COMMON-dominated tie).  Used only for the CJK-vs-non-CJK family gate. 
*/
+    private static String scriptLetters(String s) {
+        StringBuilder b = new StringBuilder();
+        s.codePoints().forEach(c -> {
+            if (c >= 0x80 && Character.isLetter(c)) {
+                b.appendCodePoint(c);
+            }
+        });
+        return b.toString();
+    }
+
+    /** Canonical {@code Charset.name()} of the WHATWG-default Latin fallback. 
*/
+    private static final String WIN1252 = "windows-1252";
+
+    /** Latin single-byte charsets the within-Latin letter gate may arbitrate.
+     *  EXCLUDES non-Latin SBCS (Cyrillic windows-1251 / ISO-8859-5, Greek
+     *  -1253 / -7, Hebrew -1255 / -8, Arabic -1256 / -6, Thai) whose cased
+     *  letters would pollute the count, and all multi-byte CJK (the family
+     *  gate's territory). */
+    private static final Set<String> LATIN_SBCS = new HashSet<>(Arrays.asList(
+            "windows-1252", "windows-1250", "windows-1254", "windows-1257", 
"windows-1258",
+            "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", 
"ISO-8859-9",
+            "ISO-8859-10", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", 
"ISO-8859-16",
+            "IBM437", "IBM850", "IBM852", "IBM858", "IBM860", "IBM861", 
"IBM863", "IBM865",
+            "x-MacRoman", "x-MacCentralEurope", "x-MacRomania", 
"x-MacIceland"));
+
+    /** Probe must have at least this many high bytes for the gate to act —
+     *  below it the letter gap is noise (most over-picks are sparse). */
+    private static final int LATIN_GATE_MIN_HIGH_BYTES = 16;
+    /** windows-1252 must win the cased-letter count by &gt; max(FLOOR, 
FRACTION
+     *  * highBytes).  The margin lets the gate cover Central-European / DOS
+     *  siblings safely — genuine CE text wins MORE letters under its true
+     *  charset so the gate stays silent — without the tie-flip that forces the
+     *  mojibuster Western-Latin fallback to scope itself out of those 
families. */
+    private static final double LATIN_GATE_MARGIN_FLOOR = 6.0;
+    private static final double LATIN_GATE_MARGIN_FRACTION = 0.20;
+
+    /**
+     * Within-Latin letter-plausibility gate (demote-only).  Demotes {@code
+     * champion} to windows-1252 only when windows-1252 is a live candidate, 
both
+     * are Latin SBCS, the probe is high-byte-dense, and windows-1252 decodes
+     * clearly MORE cased high-byte letters than the champion — the box-drawing
+     * signature, where a wrong IBM850 / x-MacRoman decode maps high bytes to
+     * symbols.  The compare is directional: a genuine Central-European / DOS 
doc
+     * wins MORE letters under its true charset, so the gate leaves it 
untouched.
+     * Latin-scoped so it never crosses the CJK boundary (the family gate 
above)
+     * or touches non-Latin SBCS.  Returns the (possibly demoted) charset.
+     */
+    static Charset applyLatinLetterGate(byte[] probe, Charset champion,
+                                        Set<Charset> candidates) {
+        String name = champion.name();
+        if (WIN1252.equals(name) || !LATIN_SBCS.contains(name)) {
+            return champion;
+        }
+        Charset win = null;
+        for (Charset c : candidates) {
+            if (WIN1252.equals(c.name())) {
+                win = c;
+                break;
+            }
+        }
+        if (win == null) {
+            return champion;
+        }
+        int high = HighByteLetterStats.countHighBytes(probe);
+        if (high < LATIN_GATE_MIN_HIGH_BYTES) {
+            return champion;
+        }
+        int winLetters = HighByteLetterStats.countCasedHighByteLetters(probe, 
win);
+        int champLetters = 
HighByteLetterStats.countCasedHighByteLetters(probe, champion);
+        double margin = Math.max(LATIN_GATE_MARGIN_FLOOR, 
LATIN_GATE_MARGIN_FRACTION * high);
+        if (winLetters > champLetters + margin) {
+            LOG.trace("junk-filter latin gate: {} -> windows-1252 (cased 
high-byte "
+                    + "letters {} vs {}, high={})", name, champLetters, 
winLetters, high);
+            return win;
+        }
+        return champion;
+    }
+
+    /**
+     * Restrict the candidate set the tournament will decode+clean+score: keep
+     * every DECLARATIVE/STRUCTURAL anchor (author intent / byte-grammar 
proof),
+     * plus the top {@link #ALWAYS_KEEP_TOP_N} STATISTICAL candidates by
+     * confidence, plus any deeper STATISTICAL candidate still within
+     * {@link #MIN_TAIL_CONFIDENCE}.  Drops the dominated low-confidence tail —
+     * the speed lever — without removing any anchor or NB's real contenders.

Review Comment:
   The Javadoc says tail candidates are kept if they are still "within 
MIN_TAIL_CONFIDENCE of the top", but the implementation actually keeps any 
STATISTICAL candidate with `confidence >= MIN_TAIL_CONFIDENCE` (absolute 
threshold) in addition to the first N. Please update the comment to match the 
code to avoid confusion (especially since `MIN_TAIL_CONFIDENCE` reads like a 
relative band).



##########
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java:
##########
@@ -67,6 +67,79 @@ public TextQualityComparison compare(String labelA, String 
candidateA,
         }
     }
 
+    /**
+     * Functional stub for the CJK-vs-non-CJK family gate.  Returns one of four
+     * controlled z-scores per scored string, keyed on whether the string
+     * contains Han ideographs (CJK family) and whether it is a "diff" string
+     * (script-letters only, i.e. every codepoint &ge; 0x80) vs whole text.
+     * Lets us drive {@code JunkFilterEncodingDetector}'s gate 
deterministically
+     * without the real model: the detector scores both the whole decoded text
+     * (champion metric) and its script-letter diff (family-gate metric) for
+     * each candidate, so the four cells fully determine the gate's decision.
+     */
+    private static final class ZStub implements TextQualityDetector {
+        private final double wholeCjk;
+        private final double wholeNonCjk;
+        private final double diffCjk;
+        private final double diffNonCjk;
+
+        ZStub(double wholeCjk, double wholeNonCjk, double diffCjk, double 
diffNonCjk) {
+            this.wholeCjk = wholeCjk;
+            this.wholeNonCjk = wholeNonCjk;
+            this.diffCjk = diffCjk;
+            this.diffNonCjk = diffNonCjk;
+        }
+
+        private static boolean isCjk(String s) {
+            return s.codePoints().anyMatch(c -> c >= 0x4E00 && c <= 0x9FFF);
+        }
+
+        /** Diff string = script-letters only: non-empty, every codepoint &ge; 
0x80. */
+        private static boolean isDiff(String s) {
+            return !s.isEmpty() && s.codePoints().allMatch(c -> c >= 0x80);
+        }
+
+        @Override
+        public TextQualityScore score(String text) {
+            boolean cjk = isCjk(text);
+            double z = isDiff(text)
+                    ? (cjk ? diffCjk : diffNonCjk)
+                    : (cjk ? wholeCjk : wholeNonCjk);
+            return new TextQualityScore((float) z, Float.NaN, Float.NaN, 
Float.NaN,
+                    cjk ? "HAN" : "LATIN");
+        }
+
+        @Override
+        public TextQualityComparison compare(String labelA, String candidateA,
+                                             String labelB, String candidateB) 
{
+            // Not exercised by the gate path (which uses score()); provided 
only
+            // to satisfy the interface.
+            return new TextQualityComparison("A", 0.0f,
+                    score(candidateA), score(candidateB), labelA, labelB);
+        }

Review Comment:
   `TextQualityComparison.winner()` is documented to return the winning *label* 
(i.e., `labelA` or `labelB`), but this stub returns the literal string "A". 
Even though this path isn't currently exercised, it violates the interface 
contract and can cause confusing failures if the test ends up using `compare()` 
in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] TIKA-4745 - charset/junk/tika-eval improvements [tika]

Reply via email to