[ 
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085974#comment-18085974
 ] 

ASF GitHub Bot commented on TIKA-4745:
--------------------------------------

Copilot commented on code in PR #2861:
URL: https://github.com/apache/tika/pull/2861#discussion_r3353345071


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java:
##########
@@ -274,6 +348,154 @@ public List<EncodingResult> detect(TikaInputStream tis, 
Metadata metadata,
         return List.of(new EncodingResult(champion, confidence));
     }
 
+    /** Minimum diff-z margin by which the other family must beat the 
champion's
+     *  family before the family gate overrides.  Large enough to ignore the
+     *  noise-level boundary ties; real CJK-vs-garbage diffs are far larger. */
+    private static final double FAMILY_DIFF_MARGIN = 2.0;
+
+    private static boolean isCjkCharset(String name) {
+        String n = name.toLowerCase(java.util.Locale.ROOT);
+        return n.contains("gb") || n.contains("big5") || n.contains("euc")
+                || n.contains("shift") || n.contains("jis") || 
n.contains("2022")
+                || n.contains("949");
+    }
+
+    /** Highest whole-text-z candidate within the requested family (CJK or 
not). */
+    private static Charset bestInFamily(Map<Charset, Double> wholeZ, boolean 
cjk) {
+        Charset best = null;
+        double bz = Double.NEGATIVE_INFINITY;
+        for (Map.Entry<Charset, Double> e : wholeZ.entrySet()) {
+            if (isCjkCharset(e.getKey().name()) == cjk && e.getValue() > bz) {
+                bz = e.getValue();
+                best = e.getKey();
+            }
+        }
+        return best;
+    }
+
+    /** Script-letter "diff" content: codepoints &ge; 0x80 that are letters/
+     *  ideographs — the high bytes where candidate decodes differ.  Shared 
ASCII
+     *  and non-ASCII punctuation/symbols are dropped (they dilute toward a
+     *  COMMON-dominated tie).  Used only for the CJK-vs-non-CJK family gate. 
*/
+    private static String scriptLetters(String s) {
+        StringBuilder b = new StringBuilder();
+        s.codePoints().forEach(c -> {
+            if (c >= 0x80 && Character.isLetter(c)) {
+                b.appendCodePoint(c);
+            }
+        });
+        return b.toString();
+    }
+
+    /** Canonical {@code Charset.name()} of the WHATWG-default Latin fallback. 
*/
+    private static final String WIN1252 = "windows-1252";
+
+    /** Latin single-byte charsets the within-Latin letter gate may arbitrate.
+     *  EXCLUDES non-Latin SBCS (Cyrillic windows-1251 / ISO-8859-5, Greek
+     *  -1253 / -7, Hebrew -1255 / -8, Arabic -1256 / -6, Thai) whose cased
+     *  letters would pollute the count, and all multi-byte CJK (the family
+     *  gate's territory). */
+    private static final Set<String> LATIN_SBCS = new HashSet<>(Arrays.asList(
+            "windows-1252", "windows-1250", "windows-1254", "windows-1257", 
"windows-1258",
+            "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", 
"ISO-8859-9",
+            "ISO-8859-10", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", 
"ISO-8859-16",
+            "IBM437", "IBM850", "IBM852", "IBM858", "IBM860", "IBM861", 
"IBM863", "IBM865",
+            "x-MacRoman", "x-MacCentralEurope", "x-MacRomania", 
"x-MacIceland"));
+
+    /** Probe must have at least this many high bytes for the gate to act —
+     *  below it the letter gap is noise (most over-picks are sparse). */
+    private static final int LATIN_GATE_MIN_HIGH_BYTES = 16;
+    /** windows-1252 must win the cased-letter count by &gt; max(FLOOR, 
FRACTION
+     *  * highBytes).  The margin lets the gate cover Central-European / DOS
+     *  siblings safely — genuine CE text wins MORE letters under its true
+     *  charset so the gate stays silent — without the tie-flip that forces the
+     *  mojibuster Western-Latin fallback to scope itself out of those 
families. */
+    private static final double LATIN_GATE_MARGIN_FLOOR = 6.0;
+    private static final double LATIN_GATE_MARGIN_FRACTION = 0.20;
+
+    /**
+     * Within-Latin letter-plausibility gate (demote-only).  Demotes {@code
+     * champion} to windows-1252 only when windows-1252 is a live candidate, 
both
+     * are Latin SBCS, the probe is high-byte-dense, and windows-1252 decodes
+     * clearly MORE cased high-byte letters than the champion — the box-drawing
+     * signature, where a wrong IBM850 / x-MacRoman decode maps high bytes to
+     * symbols.  The compare is directional: a genuine Central-European / DOS 
doc
+     * wins MORE letters under its true charset, so the gate leaves it 
untouched.
+     * Latin-scoped so it never crosses the CJK boundary (the family gate 
above)
+     * or touches non-Latin SBCS.  Returns the (possibly demoted) charset.
+     */
+    static Charset applyLatinLetterGate(byte[] probe, Charset champion,
+                                        Set<Charset> candidates) {
+        String name = champion.name();
+        if (WIN1252.equals(name) || !LATIN_SBCS.contains(name)) {
+            return champion;
+        }
+        Charset win = null;
+        for (Charset c : candidates) {
+            if (WIN1252.equals(c.name())) {
+                win = c;
+                break;
+            }
+        }
+        if (win == null) {
+            return champion;
+        }
+        int high = HighByteLetterStats.countHighBytes(probe);
+        if (high < LATIN_GATE_MIN_HIGH_BYTES) {
+            return champion;
+        }
+        int winLetters = HighByteLetterStats.countCasedHighByteLetters(probe, 
win);
+        int champLetters = 
HighByteLetterStats.countCasedHighByteLetters(probe, champion);
+        double margin = Math.max(LATIN_GATE_MARGIN_FLOOR, 
LATIN_GATE_MARGIN_FRACTION * high);
+        if (winLetters > champLetters + margin) {
+            LOG.trace("junk-filter latin gate: {} -> windows-1252 (cased 
high-byte "
+                    + "letters {} vs {}, high={})", name, champLetters, 
winLetters, high);

Review Comment:
   The trace log arguments are swapped/misleading: the message says “letters … 
vs …” in the context of demoting to windows-1252, but it currently logs 
`champLetters` first and `winLetters` second. This makes troubleshooting the 
gate harder because the logged counts don’t match the intended narrative.





> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4745
>                 URL: https://issues.apache.org/jira/browse/TIKA-4745
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a 
> number of smallish things that we can clean up in the components listed in 
> the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to