[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085974#comment-18085974
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
Copilot commented on code in PR #2861:
URL: https://github.com/apache/tika/pull/2861#discussion_r3353345071
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java:
##########
@@ -274,6 +348,154 @@ public List<EncodingResult> detect(TikaInputStream tis,
Metadata metadata,
return List.of(new EncodingResult(champion, confidence));
}
+ /** Minimum diff-z margin by which the other family must beat the
champion's
+ * family before the family gate overrides. Large enough to ignore the
+ * noise-level boundary ties; real CJK-vs-garbage diffs are far larger. */
+ private static final double FAMILY_DIFF_MARGIN = 2.0;
+
+ private static boolean isCjkCharset(String name) {
+ String n = name.toLowerCase(java.util.Locale.ROOT);
+ return n.contains("gb") || n.contains("big5") || n.contains("euc")
+ || n.contains("shift") || n.contains("jis") ||
n.contains("2022")
+ || n.contains("949");
+ }
+
+ /** Highest whole-text-z candidate within the requested family (CJK or
not). */
+ private static Charset bestInFamily(Map<Charset, Double> wholeZ, boolean
cjk) {
+ Charset best = null;
+ double bz = Double.NEGATIVE_INFINITY;
+ for (Map.Entry<Charset, Double> e : wholeZ.entrySet()) {
+ if (isCjkCharset(e.getKey().name()) == cjk && e.getValue() > bz) {
+ bz = e.getValue();
+ best = e.getKey();
+ }
+ }
+ return best;
+ }
+
+ /** Script-letter "diff" content: codepoints ≥ 0x80 that are letters/
+ * ideographs — the high bytes where candidate decodes differ. Shared
ASCII
+ * and non-ASCII punctuation/symbols are dropped (they dilute toward a
+ * COMMON-dominated tie). Used only for the CJK-vs-non-CJK family gate.
*/
+ private static String scriptLetters(String s) {
+ StringBuilder b = new StringBuilder();
+ s.codePoints().forEach(c -> {
+ if (c >= 0x80 && Character.isLetter(c)) {
+ b.appendCodePoint(c);
+ }
+ });
+ return b.toString();
+ }
+
+ /** Canonical {@code Charset.name()} of the WHATWG-default Latin fallback.
*/
+ private static final String WIN1252 = "windows-1252";
+
+ /** Latin single-byte charsets the within-Latin letter gate may arbitrate.
+ * EXCLUDES non-Latin SBCS (Cyrillic windows-1251 / ISO-8859-5, Greek
+ * -1253 / -7, Hebrew -1255 / -8, Arabic -1256 / -6, Thai) whose cased
+ * letters would pollute the count, and all multi-byte CJK (the family
+ * gate's territory). */
+ private static final Set<String> LATIN_SBCS = new HashSet<>(Arrays.asList(
+ "windows-1252", "windows-1250", "windows-1254", "windows-1257",
"windows-1258",
+ "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4",
"ISO-8859-9",
+ "ISO-8859-10", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15",
"ISO-8859-16",
+ "IBM437", "IBM850", "IBM852", "IBM858", "IBM860", "IBM861",
"IBM863", "IBM865",
+ "x-MacRoman", "x-MacCentralEurope", "x-MacRomania",
"x-MacIceland"));
+
+ /** Probe must have at least this many high bytes for the gate to act —
+ * below it the letter gap is noise (most over-picks are sparse). */
+ private static final int LATIN_GATE_MIN_HIGH_BYTES = 16;
+ /** windows-1252 must win the cased-letter count by > max(FLOOR,
FRACTION
+ * * highBytes). The margin lets the gate cover Central-European / DOS
+ * siblings safely — genuine CE text wins MORE letters under its true
+ * charset so the gate stays silent — without the tie-flip that forces the
+ * mojibuster Western-Latin fallback to scope itself out of those
families. */
+ private static final double LATIN_GATE_MARGIN_FLOOR = 6.0;
+ private static final double LATIN_GATE_MARGIN_FRACTION = 0.20;
+
+ /**
+ * Within-Latin letter-plausibility gate (demote-only). Demotes {@code
+ * champion} to windows-1252 only when windows-1252 is a live candidate,
both
+ * are Latin SBCS, the probe is high-byte-dense, and windows-1252 decodes
+ * clearly MORE cased high-byte letters than the champion — the box-drawing
+ * signature, where a wrong IBM850 / x-MacRoman decode maps high bytes to
+ * symbols. The compare is directional: a genuine Central-European / DOS
doc
+ * wins MORE letters under its true charset, so the gate leaves it
untouched.
+ * Latin-scoped so it never crosses the CJK boundary (the family gate
above)
+ * or touches non-Latin SBCS. Returns the (possibly demoted) charset.
+ */
+ static Charset applyLatinLetterGate(byte[] probe, Charset champion,
+ Set<Charset> candidates) {
+ String name = champion.name();
+ if (WIN1252.equals(name) || !LATIN_SBCS.contains(name)) {
+ return champion;
+ }
+ Charset win = null;
+ for (Charset c : candidates) {
+ if (WIN1252.equals(c.name())) {
+ win = c;
+ break;
+ }
+ }
+ if (win == null) {
+ return champion;
+ }
+ int high = HighByteLetterStats.countHighBytes(probe);
+ if (high < LATIN_GATE_MIN_HIGH_BYTES) {
+ return champion;
+ }
+ int winLetters = HighByteLetterStats.countCasedHighByteLetters(probe,
win);
+ int champLetters =
HighByteLetterStats.countCasedHighByteLetters(probe, champion);
+ double margin = Math.max(LATIN_GATE_MARGIN_FLOOR,
LATIN_GATE_MARGIN_FRACTION * high);
+ if (winLetters > champLetters + margin) {
+ LOG.trace("junk-filter latin gate: {} -> windows-1252 (cased
high-byte "
+ + "letters {} vs {}, high={})", name, champLetters,
winLetters, high);
Review Comment:
The trace log arguments are swapped/misleading: the message says “letters …
vs …” in the context of demoting to windows-1252, but it currently logs
`champLetters` first and `winLetters` second. This makes troubleshooting the
gate harder because the logged counts don’t match the intended narrative.
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)