This is an automated email from the ASF dual-hosted git repository.
tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 4ce5c7011b TIKA-4745 - charset/junk/tika-eval improvements (#2861)
4ce5c7011b is described below
commit 4ce5c7011bba18889826d32799720768b5e6d42f
Author: Tim Allison <[email protected]>
AuthorDate: Thu Jun 4 05:52:39 2026 -0400
TIKA-4745 - charset/junk/tika-eval improvements (#2861)
---
.skills/tika-eval-compare.md | 10 +
.skills/tika-eval-encoding-regression.md | 35 +++
.skills/tika-eval-h2-query.md | 51 ++++
.../pages/advanced/charset-detection-design.adoc | 90 ++++++-
.../integration-testing/tika-eval-regression.adoc | 21 +-
.../org/apache/tika/detect/CharsetSupersets.java | 2 +
.../apache/tika/detect/HighByteLetterStats.java | 94 +++++++
.../apache/tika/detect/CharsetSupersetsTest.java | 67 +++++
.../tika/detect/HighByteLetterStatsTest.java | 72 ++++++
.../tika/ml/chardetect/CjkDecodeValidator.java | 151 +++++++++++
.../tika/ml/chardetect/CosineFamilyArbiter.java | 241 ++++++++++++++++++
.../ml/chardetect/MojibusterEncodingDetector.java | 194 ++++++++------
.../NaiveBayesBigramEncodingDetector.java | 21 +-
.../apache/tika/ml/chardetect/cosine-profiles.bin | Bin 0 -> 1080313 bytes
.../org/apache/tika/ml/chardetect/nb-bigram.bin | Bin 1016638 -> 1008871
bytes
.../tika/ml/chardetect/CjkDecodeValidatorTest.java | 81 ++++++
.../tika/ml/chardetect/Iso2022DetectionTest.java | 83 ++++++
.../org/apache/tika/eval/app/ExtractComparer.java | 5 +
.../tika/eval/app/ExtractComparerRunner.java | 2 +
.../apache/tika/eval/app/ExtractProfileRunner.java | 1 +
.../org/apache/tika/eval/app/ExtractProfiler.java | 9 +
.../org/apache/tika/eval/app/ProfilerBase.java | 46 ++++
.../java/org/apache/tika/eval/app/db/Cols.java | 6 +-
.../eval/app/reports/MarkdownSummaryWriter.java | 8 +-
.../tika/eval/core/langid/LanguageIDWrapper.java | 16 +-
.../eval/core/textstats/NonAsciiCharCounter.java | 39 +++
.../core/textstats/ReplacementCharCounter.java | 39 +++
.../ml/junkdetect/JunkFilterEncodingDetector.java | 283 +++++++++++++++++++--
.../junkdetect/JunkFilterEncodingDetectorTest.java | 183 +++++++++++++
.../tika/ml/junkdetect/LatinLetterGateTest.java | 110 ++++++++
30 files changed, 1842 insertions(+), 118 deletions(-)
diff --git a/.skills/tika-eval-compare.md b/.skills/tika-eval-compare.md
index 4fc628cc06..d4549d26c8 100644
--- a/.skills/tika-eval-compare.md
+++ b/.skills/tika-eval-compare.md
@@ -120,6 +120,16 @@ directory, plus a `summary.md` with key metrics:
| Exception count | ≤ A | > A |
| Total files (B) vs (A) | equal or higher | lower — missing embedded docs |
+### Encoding-detection evals
+
+For charset/encoding-detector changes, the summary reports don't cover it —
query
+the db directly (see the **tika-eval-h2-query** skill). The detected encoding
is in
+the `ENCODINGS_A`/`ENCODINGS_B` tables (`DETECTED_ENCODING`,
`ENCODING_DETECTOR`,
+`DECLARED_METADATA`), **not** `PROFILES`. Key signals: per-encoding counts
(e.g. CJK
+total), A→B flips by direction, and OOV on the flipped files (a flip that
*worsens*
+OOV is a regression; one that *improves* it is a fix). Pair on `ID`; map back
to the
+source file via `PROFILES_*.FILE_NAME` (the content hash).
+
### CRITICAL: Review Checklist
The purpose of tika-eval is to find regressions BEFORE a release. After
diff --git a/.skills/tika-eval-encoding-regression.md
b/.skills/tika-eval-encoding-regression.md
index 1d3e61a67c..7532df3981 100644
--- a/.skills/tika-eval-encoding-regression.md
+++ b/.skills/tika-eval-encoding-regression.md
@@ -123,6 +123,41 @@ WHERE <enc_a/enc_b filter as above>
ORDER BY delta ASC LIMIT 15;
```
+## Reading the signals — OOV, languageness, and FFFD together
+
+No single signal is authoritative. Use `oov` as a **secondary** signal
alongside
+`languageness` (the junk-model coherence z-score) and the U+FFFD rate — each is
+right where the others are blind, so cross-check rather than ranking on any
one.
+(Established 2026-06-03: a 40-file OOV-"worse" set was mostly metric artifacts
+once languageness/FFFD were brought in — only ~6 were real. But OOV is also the
+*correct* signal where languageness is blind, so neither dominates.)
+
+- **OOV can mislead** when langid shifts — a CJK/UTF-8 recovery in B is scored
+ against a different vocab → higher OOV though B is right — or when a wrong
+ decode fragments words into more short common tokens (→ higher count for the
+ WORSE decode). A common-token delta is a signal, not proof.
+- **languageness can mislead** on SBCS↔SBCS cross-script mojibake — Greek
decoded
+ as KOI8-R is "coherent" Cyrillic, so `languageness` stays flat while `oov`
+ correctly flags it. Conversely languageness catches OOV's CJK/script-recovery
+ blind spot. Each covers the other's blind spot.
+- **FFFD rate** flags decode failures (illegal bytes): `num_replacement /
+ num_non_ascii` (un-diluted; `/ content_length` dilutes to ~0 on ASCII-heavy
+ docs). Tika strips C0 controls at extraction, so legal-but-wrong (C1)
mojibake
+ does not surface here — that signal belongs in the detector chain, not the
eval.
+- **In practice:** when the signals agree, high confidence; when they disagree
+ (OOV-worse but languageness-better, or vice versa), that file needs a look —
+ the disagreement points you at WHICH files to inspect, it does not by itself
+ declare OOV or languageness "wrong." Split OOV-worse by languageness
direction
+ (query in `tika-eval-regression.adoc`).
+
+### Isolate a change against the PRIOR run, not just 3.x
+
+To see what one chain change actually did, Compare the new run against the
+*previous* 4.x run (B-new vs B-prior), not only vs 3.x. The diff should be
+*surgical* — e.g. the within-Latin letter gate moved exactly 6 files
+(IBM850 / x-MacRoman → windows-1252) vs the prior run and nothing else. A
+bigger-than-expected diff means the change fired more broadly than intended.
+
## Per-file detector attribution (`X-TIKA:encodingDetectionTrace`)
Every JSON extract from a chain with multiple detectors carries
diff --git a/.skills/tika-eval-h2-query.md b/.skills/tika-eval-h2-query.md
index d4f0b2c378..37532d64c0 100644
--- a/.skills/tika-eval-h2-query.md
+++ b/.skills/tika-eval-h2-query.md
@@ -40,6 +40,7 @@ it then waits on stdin and appears to hang).
|---|---|
| `PROFILES_A` / `PROFILES_B` | one row per extracted file: `FILE_NAME`,
`MD5`, `MIME_ID`, `CONTAINER_ID`, `EMBEDDED_FILE_PATH`, `LENGTH`, `NUM_PAGES`,
… (A = "before"/-a, B = "after"/-b) |
| `CONTENTS_A` / `CONTENTS_B` | text profile per file (join on `ID`): `OOV`,
`LANGUAGENESS`, `NUM_TOKENS`, `NUM_COMMON_TOKENS`,
`LANG_ID_1`/`LANG_ID_PROB_1`, `TOKEN_ENTROPY_RATE`, … |
+| `ENCODINGS_A` / `ENCODINGS_B` | detected-encoding per file (join on `ID`):
`DETECTED_ENCODING`, `ENCODING_DETECTOR`, `DECLARED_METADATA`.
**`DETECTED_ENCODING` lives HERE, not on `PROFILES` — moved out in the
encodings-table refactor; querying `PROFILES_*.DETECTED_ENCODING` now errors
"Column not found".** A file with no detected encoding has no row. |
| `CONTENT_COMPARISONS` | per-file A↔B comparison (`ID`): `DICE_COEFFICIENT`,
`OVERLAP`, top token diffs |
| `MIMES` | `MIME_ID` → `MIME_STRING` |
| `CONTAINERS` | container id → input file path |
@@ -86,6 +87,56 @@ FROM CONTENTS_A ca JOIN CONTENTS_B cb ON ca.ID = cb.ID;
To bring in mime/path, join `PROFILES_A pa ON pa.ID = ca.ID` (and `pb`/`cc`
likewise on the same `id`) — all on `id`.
+Detected-encoding queries — `DETECTED_ENCODING` is on
`ENCODINGS_A`/`ENCODINGS_B`
+(join on `ID`), NOT `PROFILES`. CJK count in B (LOWER() — `REGEXP` is
case-sensitive,
+see below):
+
+```sql
+SELECT COUNT(*) FROM ENCODINGS_B
+WHERE LOWER(DETECTED_ENCODING) REGEXP 'gb|big5|euc|shift|jis|2022|949';
+```
+
+Encoding flips A→B by direction (what changed between runs):
+
+```sql
+SELECT ea.DETECTED_ENCODING a_enc, eb.DETECTED_ENCODING b_enc, COUNT(*) n
+FROM ENCODINGS_A ea JOIN ENCODINGS_B eb ON ea.ID = eb.ID
+WHERE ea.DETECTED_ENCODING <> eb.DETECTED_ENCODING
+GROUP BY a_enc, b_enc ORDER BY n DESC;
+```
+
+Map a flipped file back to its source file — `PROFILES_*.FILE_NAME` is the
content
+hash (the input file is `<corpus>/<first-2-hex>/<FILE_NAME>`); join `CONTENTS`
for OOV:
+
+```sql
+SELECT pb.FILE_NAME, ea.DETECTED_ENCODING a_enc, eb.DETECTED_ENCODING b_enc,
+ ca.OOV oov_a, cb.OOV oov_b
+FROM ENCODINGS_A ea JOIN ENCODINGS_B eb ON ea.ID = eb.ID
+ JOIN PROFILES_B pb ON ea.ID = pb.ID
+ JOIN CONTENTS_A ca ON ea.ID = ca.ID JOIN CONTENTS_B cb ON ea.ID = cb.ID
+WHERE LOWER(eb.DETECTED_ENCODING) REGEXP 'gb|big5|euc|shift|jis|2022|949'
+ AND NOT (LOWER(ea.DETECTED_ENCODING) REGEXP
'gb|big5|euc|shift|jis|2022|949');
+```
+
+## Gotcha: `REGEXP` is case-sensitive (silent wrong results)
+
+H2's `REGEXP` operator is **case-sensitive**, so `DETECTED_ENCODING REGEXP
+'big5|gb|euc'` does **not** match `Big5-HKSCS` or `GB18030` — and it fails
+*silently*, quietly dropping/keeping the wrong rows instead of erroring. Always
+either lowercase the column or use the inline case-insensitive flag:
+
+```sql
+-- right:
+WHERE LOWER(DETECTED_ENCODING) REGEXP 'big5|gb|euc|shift|jis|2022|949'
+-- or:
+WHERE DETECTED_ENCODING REGEXP '(?i)big5|gb|euc|shift|jis|2022|949'
+-- wrong (misses Big5-HKSCS, GB18030, Shift_JIS, ...):
+WHERE DETECTED_ENCODING REGEXP 'big5|gb|euc|shift|jis|2022|949'
+```
+
+(`DETECTED_ENCODING` is on `ENCODINGS_A`/`ENCODINGS_B` — join to
`PROFILES`/`CONTENTS`
+on `ID` — populated from `X-TIKA:detectedEncoding`.)
+
## Tip
For a quick interactive session, drop `-sql` and you get an H2 prompt; `SHOW
diff --git a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
index fc870e5d51..815ed375e4 100644
--- a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
+++ b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
@@ -53,7 +53,7 @@ results are collected into an `EncodingDetectorContext` on the
| `MojibusterEncodingDetector`
| `tika-encoding-detector-mojibuster`
| Structural UTF-32 and UTF-16 detection, UTF-8 grammar gate, HTML
- attribute-aware stripping, then a 33-class byte-bigram NB
+ attribute-aware stripping, then a 34-class byte-bigram NB
classifier. STRUCTURAL for structural hits; STATISTICAL for NB
predictions. See <<nb-pipeline>>.
@@ -135,6 +135,21 @@ sequences. Three outcomes:
* `AMBIGUOUS` — no complete multi-byte sequence (pure ASCII, or only
a truncated lead at probe-end). No emission.
+=== ISO-2022-JP/KR/CN structural detection (pure-ASCII branch)
+
+ISO-2022 encodings are 7-bit and escape-based (`ESC $ B`, `ESC $ ) C`, …),
+so they carry no high bytes and are invisible to the byte-bigram
+classifier; without a structural check a real ISO-2022-JP page would fall
+through to the windows-1252 default and decode to gibberish. On a
+pure-ASCII probe — the only place ISO-2022 can occur — the pipeline scans
+for the ISO-2022 designation escape and, if found, *verifies* by decoding:
+the result must contain real CJK at a near-zero replacement rate. The
+verify rejects a stray `ESC $` in ordinary ASCII (which yields no CJK).
+On success an ISO-2022-JP/KR/CN STRUCTURAL candidate is emitted. High-byte
+binary that happens to contain an escape sequence never reaches this
+check — it fails the pure-ASCII gate and takes the normal NB path, so it
+cannot trigger a false ISO-2022 detection.
+
=== Layer 4 — HTML stripping (content-type aware)
When the probe looks like HTML/XML (explicit content-type or unknown),
@@ -155,10 +170,10 @@ just content bytes for NB feature extraction.
Optimizations:
=== Layer 5 — Naive Bayes byte-bigram classifier
-33 classes: CJK multibyte (Big5-HKSCS, EUC-JP, GB18030, Shift_JIS,
+34 classes: CJK multibyte (Big5-HKSCS, EUC-JP, GB18030, Shift_JIS,
x-EUC-TW, x-windows-949), EBCDIC family (IBM420/424-ltr/rtl, IBM500,
IBM1047), DOS OEM (IBM850/852/855/866), Cyrillic (KOI8-R, KOI8-U),
-Windows single-byte (1250-1258, 874), ISO-8859-3/16, Mac (x-MacRoman,
+Windows single-byte (1250-1258, 874), ISO-8859-2/3/16, Mac (x-MacRoman,
x-mac-cyrillic), and UTF-8.
Features are **stride-1 byte bigrams** — for probe bytes `b[0..N]`,
@@ -212,9 +227,10 @@ every probe length we've measured.
* **Empty / near-empty probes (< 2 bytes)** → windows-1252 @ 0.1
confidence. WHATWG default; never returns empty result.
-* **Pure ASCII probes** (no bytes ≥ 0x80, no nulls) → windows-1252.
- Bigram NB cannot discriminate Latin code pages on pure-ASCII
- content; return the HTML5-canonical answer directly.
+* **Pure ASCII probes** (no bytes ≥ 0x80, no nulls) → ISO-2022 structural
+ detection first (see above); otherwise windows-1252. Bigram NB cannot
+ discriminate Latin code pages on pure-ASCII content; return the
+ HTML5-canonical answer directly.
* **Latin-sibling → windows-1252 rewrite** — on low-evidence probes
(< 5 high bytes), if the top NB candidate is a non-1252 member of
the Latin family and the probe decodes byte-identically under
@@ -223,6 +239,34 @@ every probe length we've measured.
threshold are not emitted into the pool. Prevents JunkFilter from
scoring weak coincidence picks against NB's confident top.
+==== CJK decode-failure veto (`CjkDecodeValidator`)
+
+A legacy multi-byte CJK class (GB18030, Big5-HKSCS, Shift_JIS, EUC-JP,
+x-windows-949, x-EUC-TW) that NB picks on Latin/Cyrillic/garbage bytes is
+*false-CJK*: those bytes don't validate under the charset, so decoding
+produces many malformed/unmappable events, whereas real CJK decodes
+cleanly. After NB, each legacy-CJK candidate is decoded under its vendor
+superset (`CharsetSupersets`) and its failure rate measured as
+`failures / high-bytes`; above ~2.5% the candidate is dropped — and if it
+was NB's only pick, the pool empties and windows-1252 wins. Two
+corrections make the rate trustworthy:
+
+* **Decode under the vendor superset, not the strict base** — real
+ vendor-extension chars (NEC/IBM for Shift_JIS/EUC-JP, HKSCS for Big5)
+ would otherwise count as failures and penalize genuine CJK.
+* **Discount embedded UTF-8** — mixed-encoding pages (legacy CJK body +
+ UTF-8 widgets) would otherwise read as 2–9.5% failure. The validator
+ walks the bytes and *skips* positions that begin a valid UTF-8 sequence
+ (it does NOT physically strip them — that would misalign a pure
+ legacy-CJK stream and manufacture failures), decoding the legacy charset
+ in place elsewhere. Post-discount, real CJK (pure or mixed) is ≤1.6%
+ while genuine false-CJK stays ≥5.3%, so ~2.5% separates them.
+
+This veto catches *structurally-illegal* false-CJK only. The
+*legal-but-wrong* class — Latin/Cyrillic bytes that form a *valid* CJK
+decode at ~0 failure — is the typicality layer's job (<<junk-filter>>),
+not this veto's.
+
[[junk-filter]]
== JunkFilterEncodingDetector — text-quality arbitration
@@ -261,6 +305,36 @@ For plain first-match-wins, omit JunkFilter (see
<<opting-out-of-arbitration>>).
. **Pairwise tournament** — first candidate seeds champion; each
challenger compared via `JunkDetector.compare`; higher z-score wins.
+=== Post-tournament demote gates
+
+Two demote-only refinements run after the champion is chosen. Each fires only
+to *demote* the champion across one boundary the whole-text z-score reads
+poorly under COMMON-dilution; neither can promote, so they cannot cost a
+confident detection.
+
+* **CJK family gate** — the whole-text z coin-flips on the CJK/non-CJK boundary
+ when markup and digits decode identically and swamp the few discriminating
+ high bytes. A script-letter "diff" z — scored over only the `>= 0x80`
+ letters/ideographs, where candidates actually differ — reads that boundary
+ cleanly. If the champion is CJK and the best non-CJK diff-z beats the best
+ CJK diff-z by `FAMILY_DIFF_MARGIN` (2.0), demote to the best non-CJK
+ candidate. The reverse (promote to CJK) regressed at scale and is
+ unnecessary — genuine CJK is `<meta>`-declared upstream.
+
+* **Within-Latin letter gate** — among single-byte Latin siblings the z also
+ coin-flips, occasionally promoting a DOS-OEM / Mac charset (IBM850,
+ x-MacRoman) whose high bytes decode to box-drawing / symbols over the
+ windows-1252 truth. Cased-letter count reads this where typicality cannot:
+ if the champion is a Latin SBCS, a windows-1252 candidate is present, the
+ probe is high-byte-dense, and windows-1252 decodes clearly more cased
+ high-byte letters (by a margin), demote to windows-1252. Directional — a
+ genuine Central-European / DOS document has *more* letters under its true
+ charset, so the gate stays silent. Latin-scoped, so it never crosses the
+ CJK boundary (the family gate's job) or touches a non-Latin SBCS, whose
+ Cyrillic/Greek cased letters would pollute the count. Shares the
+ `HighByteLetterStats` letter counter with Mojibuster's Western-Latin sibling
+ fallback.
+
=== JunkDetector scoring
`JunkDetector` partitions decoded text into maximal Unicode-script runs
@@ -485,7 +559,7 @@ can't encode typographic characters).
value, vocabulary size, and each trained bigram as
`(uint16 bigram, int8 logP)` pairs.
-Files for the shipped 33-class model are ~1 MB on disk. Loader
+Files for the shipped 34-class model are ~1 MB on disk. Loader
materializes a dense `logP8[65 536 × numClasses]` array filled with
per-class unseen floors, overwritten by trained pairs. Working-set
memory: ~2 MB.
@@ -498,7 +572,7 @@ with feature hashing. The move to NB was driven by:
* **Speed**: direct bigram indexing removes the hash + bucket-lookup
cost. Inner loop is `score[c] += logP[b × numClasses + c] × idf[b]`
with no branching (zero-IDF bigrams are skipped before the class
- loop). Measured ~15 µs on a full 1 KB probe for 33 classes.
+ loop). Measured ~15 µs on a full 1 KB probe for 34 classes.
* **Memory layout**: bigram-major byte arrays fit in L3 cache for the
full table. Sequential access through the hot loop is cache-line
efficient.
diff --git
a/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc
b/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc
index a81f6fabd4..61a8981633 100644
---
a/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc
+++
b/docs/modules/ROOT/pages/advanced/integration-testing/tika-eval-regression.adoc
@@ -339,7 +339,8 @@ waiting on stdin).
Key tables: `profiles_a`/`profiles_b` (one row per extracted file: `file_name`,
`mime_id`, `length`, …), `contents_a`/`contents_b` (text profile: `oov`,
-`languageness`, `num_tokens`, `lang_id_1`, …), `content_comparisons`
+`languageness`, `num_tokens`, `lang_id_1`, `num_replacement` (U+FFFD count),
+`num_non_ascii`, …), `content_comparisons`
(`dice_coefficient`, `overlap`), `mimes`, `containers`. *A and B are paired by
`id`* — the same row `id` is the same file in both runs (this is how the
built-in
reports join: `join profiles_b pb on pa.id = pb.id`). Always join on `id`.
@@ -351,6 +352,24 @@ SELECT SUM(CASE WHEN cb.oov < ca.oov THEN 1 ELSE 0 END) AS
oov_better,
SUM(CASE WHEN cb.oov > ca.oov THEN 1 ELSE 0 END) AS oov_worse
FROM contents_a ca JOIN contents_b cb ON ca.id = cb.id;
+-- NOTE: OOV is one signal, not the verdict -- read it with languageness and
the
+-- FFFD rate (use OOV as a secondary signal). OOV can mislead (a langid shift,
+-- e.g. a CJK decode recovered in B, inflates oov_worse even when B is
correct; a
+-- wrong decode that fragments words can LOWER OOV), and languageness can
mislead
+-- on SBCS-cross-script mojibake -- each is right where the other is blind.
When
+-- OOV-worse and languageness disagree, that file needs a look (split below):
+SELECT SUM(CASE WHEN cb.languageness > ca.languageness + 0.2 THEN 1 ELSE 0
END) AS lang_better_oov_lied,
+ SUM(CASE WHEN cb.languageness < ca.languageness - 0.2 THEN 1 ELSE 0
END) AS lang_worse_real_candidate
+FROM contents_a ca JOIN contents_b cb ON ca.id = cb.id
+WHERE cb.oov > ca.oov + 0.02 AND ca.languageness > -90 AND cb.languageness >
-90;
+
+-- FFFD decode-failure rate, un-diluted (over non-ASCII chars, NOT total
length,
+-- which dilutes to ~0 on ASCII-dominated docs)
+SELECT ROUND(100.0 * cb.num_replacement / NULLIF(cb.num_non_ascii, 0), 1) AS
fffd_pct,
+ cb.num_replacement, cb.num_non_ascii
+FROM contents_b cb WHERE cb.num_replacement > 0
+ORDER BY cb.num_replacement DESC FETCH FIRST 20 ROWS ONLY;
+
-- net common-tokens A vs B (headline "more real text recovered" metric)
SELECT SUM(ca.num_common_tokens) AS common_a,
SUM(cb.num_common_tokens) AS common_b,
diff --git
a/tika-core/src/main/java/org/apache/tika/detect/CharsetSupersets.java
b/tika-core/src/main/java/org/apache/tika/detect/CharsetSupersets.java
index f53c98f847..88bd5416bb 100644
--- a/tika-core/src/main/java/org/apache/tika/detect/CharsetSupersets.java
+++ b/tika-core/src/main/java/org/apache/tika/detect/CharsetSupersets.java
@@ -42,6 +42,7 @@ import java.util.Map;
* <li>GB2312 → GB18030 (GB18030 is a strict superset of both GB2312 and
GBK)</li>
* <li>GBK → GB18030 (GB18030 is a strict superset; enables 4-byte extension
sequences)</li>
* <li>Shift_JIS → windows-31j (MS932 is a strict superset with NEC/IBM
extensions)</li>
+ * <li>EUC-JP → x-eucJP-Open (EUC packing of the NEC/IBM vendor
extensions)</li>
* </ul>
*/
public final class CharsetSupersets {
@@ -59,6 +60,7 @@ public final class CharsetSupersets {
m.put("GB2312", "GB18030");
m.put("GBK", "GB18030");
m.put("Shift_JIS", "windows-31j");
+ m.put("EUC-JP", "x-eucJP-Open");
SUPERSET_MAP = Collections.unmodifiableMap(m);
}
diff --git
a/tika-core/src/main/java/org/apache/tika/detect/HighByteLetterStats.java
b/tika-core/src/main/java/org/apache/tika/detect/HighByteLetterStats.java
new file mode 100644
index 0000000000..a06f426d4b
--- /dev/null
+++ b/tika-core/src/main/java/org/apache/tika/detect/HighByteLetterStats.java
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.detect;
+
+import java.nio.charset.Charset;
+
+/**
+ * High-byte decode-quality statistics shared by the charset detectors.
+ *
+ * <p>Used to disambiguate single-byte <em>Latin</em> charset siblings
+ * (windows-1252 vs IBM850 / x-MacRoman / ISO-8859-x), where a wrong decode
maps
+ * high bytes to box-drawing / symbols while the right one maps them to
accented
+ * letters. The cased-letter count reads that boundary; the byte-bigram
+ * typicality models cannot (both decodes look like typical Latin, and on
+ * COMMON-dominated docs the discriminating bytes are diluted to noise).</p>
+ *
+ * <p><b>Latin-only.</b> {@link #countCasedHighByteLetters} counts Lu/Ll/Lt,
+ * which also covers Cyrillic/Greek cased letters and would be polluted by a
+ * non-Latin SBCS; and it excludes Lo, so a CJK decode (every ideograph is Lo)
+ * cannot win on "letters". Callers must restrict the comparison to Latin SBCS
+ * candidates.</p>
+ */
+public final class HighByteLetterStats {
+
+ private HighByteLetterStats() {
+ }
+
+ /** Count of bytes ≥ 0x80 in the probe. */
+ public static int countHighBytes(byte[] probe) {
+ if (probe == null) {
+ return 0;
+ }
+ int n = 0;
+ for (byte b : probe) {
+ if ((b & 0xFF) >= 0x80) {
+ n++;
+ }
+ }
+ return n;
+ }
+
+ /**
+ * Decode {@code probe} under {@code cs} and count codepoints ≥ 0x80
that
+ * are Unicode cased letters (Lu/Ll/Lt). Excludes the ordinal /
superscript
+ * indicators ª (U+00AA), º (U+00BA), ⁿ (U+207F): MacRoman's 0xBB/0xBC are
+ * ª/º while windows-1252's 0xBB is » (punctuation), so without the
exclusion
+ * MacRoman's letter count would beat windows-1252's wherever » appears.
+ * Lo (CJK / other-letter) is excluded by counting cased categories only.
+ */
+ public static int countCasedHighByteLetters(byte[] probe, Charset cs) {
+ if (probe == null) {
+ return 0;
+ }
+ String decoded;
+ try {
+ decoded = new String(probe, cs);
+ } catch (Exception e) {
+ return 0;
+ }
+ int count = 0;
+ for (int i = 0; i < decoded.length(); ) {
+ int cp = decoded.codePointAt(i);
+ if (cp >= 0x80 && isCasedLatinishLetter(cp)) {
+ count++;
+ }
+ i += Character.charCount(cp);
+ }
+ return count;
+ }
+
+ private static boolean isCasedLatinishLetter(int cp) {
+ if (cp == 0x00AA || cp == 0x00BA || cp == 0x207F) {
+ return false; // ª, º, ⁿ — ordinal / superscript indicators
+ }
+ int type = Character.getType(cp);
+ return type == Character.UPPERCASE_LETTER
+ || type == Character.LOWERCASE_LETTER
+ || type == Character.TITLECASE_LETTER;
+ }
+}
diff --git
a/tika-core/src/test/java/org/apache/tika/detect/CharsetSupersetsTest.java
b/tika-core/src/test/java/org/apache/tika/detect/CharsetSupersetsTest.java
new file mode 100644
index 0000000000..738fad6196
--- /dev/null
+++ b/tika-core/src/test/java/org/apache/tika/detect/CharsetSupersetsTest.java
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.detect;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNull;
+
+import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
+
+import org.junit.jupiter.api.Test;
+
+public class CharsetSupersetsTest {
+
+ private static String name(String detected) {
+ Charset s = CharsetSupersets.supersetOf(Charset.forName(detected));
+ return s == null ? null : s.name();
+ }
+
+ @Test
+ public void mapsLegacyCjkToVendorSupersets() {
+ assertEquals("x-windows-949", name("EUC-KR"));
+ assertEquals("Big5-HKSCS", name("Big5"));
+ assertEquals("GB18030", name("GB2312"));
+ assertEquals("GB18030", name("GBK"));
+ assertEquals("windows-31j", name("Shift_JIS"));
+ assertEquals("x-eucJP-Open", name("EUC-JP"));
+ }
+
+ @Test
+ public void returnsNullWhenNoSuperset() {
+ assertNull(CharsetSupersets.supersetOf(null));
+ assertNull(CharsetSupersets.supersetOf(StandardCharsets.UTF_8));
+
assertNull(CharsetSupersets.supersetOf(Charset.forName("windows-1252")));
+ // Superset targets have no further superset.
+ assertNull(CharsetSupersets.supersetOf(Charset.forName("GB18030")));
+ }
+
+ /** The point of the map: vendor-extension bytes the strict base drops to
+ * U+FFFD decode correctly under the superset. */
+ @Test
+ public void supersetRecoversVendorExtensionChars() {
+ // CP932/EUC-JP NEC special U+2460 (circled one): strict base fails to
+ // U+FFFD, superset maps it.
+ byte[] sjis = {(byte) 0x87, (byte) 0x40};
+ assertEquals('\uFFFD', new String(sjis,
Charset.forName("Shift_JIS")).charAt(0));
+ assertEquals("\u2460", new String(sjis,
Charset.forName(name("Shift_JIS"))));
+
+ byte[] eucjp = {(byte) 0xAD, (byte) 0xA1};
+ assertEquals('\uFFFD', new String(eucjp,
Charset.forName("EUC-JP")).charAt(0));
+ assertEquals("\u2460", new String(eucjp,
Charset.forName(name("EUC-JP"))));
+ }
+}
diff --git
a/tika-core/src/test/java/org/apache/tika/detect/HighByteLetterStatsTest.java
b/tika-core/src/test/java/org/apache/tika/detect/HighByteLetterStatsTest.java
new file mode 100644
index 0000000000..de4379c99f
--- /dev/null
+++
b/tika-core/src/test/java/org/apache/tika/detect/HighByteLetterStatsTest.java
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.detect;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import java.nio.charset.Charset;
+
+import org.junit.jupiter.api.Test;
+
+public class HighByteLetterStatsTest {
+
+ private static final Charset WIN1252 = Charset.forName("windows-1252");
+ private static final Charset IBM850 = Charset.forName("IBM850");
+ private static final Charset SHIFT_JIS = Charset.forName("Shift_JIS");
+
+ /** Bytes 0xC0-0xCF are À-Ï (all letters) in windows-1252 but mostly
+ * box-drawing (└┴┬├─┼ ... ¤) in IBM850 — the box-drawing signature the
+ * within-Latin gate keys on. */
+ @Test
+ void winBeatsIbm850OnBoxDrawingRange() {
+ byte[] probe = new byte[16];
+ for (int i = 0; i < 16; i++) {
+ probe[i] = (byte) (0xC0 + i);
+ }
+ int win = HighByteLetterStats.countCasedHighByteLetters(probe,
WIN1252);
+ int ibm = HighByteLetterStats.countCasedHighByteLetters(probe, IBM850);
+ assertEquals(16, win, "all of 0xC0-0xCF are letters in windows-1252");
+ assertTrue(ibm <= 4, "IBM850 maps most of 0xC0-0xCF to box-drawing;
was " + ibm);
+ assertTrue(win > ibm + 6, "decisive letter gap expected; win=" + win +
" ibm=" + ibm);
+ }
+
+ /** ª (0xAA), º (0xBA) are ordinal indicators, not letters; é (0xE9) is. */
+ @Test
+ void excludesOrdinalIndicators() {
+ byte[] probe = {(byte) 0xAA, (byte) 0xBA, (byte) 0xE9};
+ assertEquals(1, HighByteLetterStats.countCasedHighByteLetters(probe,
WIN1252),
+ "only é should count; ª and º are ordinal indicators");
+ }
+
+ /** CJK ideographs are Lo (other-letter), excluded — so a CJK decode can
+ * never win the cased-letter comparison against a Latin sibling. */
+ @Test
+ void doesNotCountCjkIdeographs() {
+ byte[] probe = "日本語の文章".getBytes(SHIFT_JIS);
+ assertEquals(0, HighByteLetterStats.countCasedHighByteLetters(probe,
SHIFT_JIS),
+ "ideographs are Lo and must not count as cased letters");
+ }
+
+ @Test
+ void countHighBytesIsByteCountAtOrAbove0x80() {
+ byte[] probe = {0x41, (byte) 0x80, (byte) 0xFF, 0x20, (byte) 0xC3};
+ assertEquals(3, HighByteLetterStats.countHighBytes(probe));
+ assertEquals(0, HighByteLetterStats.countHighBytes(new byte[0]));
+ assertEquals(0, HighByteLetterStats.countHighBytes(null));
+ }
+}
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java
new file mode 100644
index 0000000000..4c7254c6ee
--- /dev/null
+++
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.nio.ByteBuffer;
+import java.nio.CharBuffer;
+import java.nio.charset.Charset;
+import java.nio.charset.CharsetDecoder;
+import java.nio.charset.CoderResult;
+import java.nio.charset.CodingErrorAction;
+import java.util.Locale;
+
+import org.apache.tika.detect.CharsetSupersets;
+
+/**
+ * Structural false-CJK veto: measures how badly a probe fails to decode under
a
+ * legacy multi-byte CJK charset, robustly against embedded UTF-8.
+ *
+ * <p>A Latin/Cyrillic/garbage page mis-detected as a legacy CJK charset
decodes
+ * with many malformed/unmappable sequences; real CJK decodes cleanly. Two
+ * corrections make the rate meaningful (see the findings doc):
+ * <ol>
+ * <li>decode under the <em>vendor superset</em> ({@link CharsetSupersets})
so
+ * real vendor-extension chars aren't counted as failures;</li>
+ * <li><strong>discount embedded UTF-8</strong> — mixed-encoding pages
(legacy
+ * CJK body + UTF-8 widgets) would otherwise inflate the rate.
Post-discount,
+ * real CJK (pure or mixed) is ≤1.6% while genuine false-CJK stays
≥5.3%.</li>
+ * </ol>
+ *
+ * <p>The discount is done by a <em>UTF-8-aware single pass</em>, NOT by
physically
+ * stripping UTF-8 runs: a real legacy-CJK char can coincidentally match UTF-8
+ * grammar (e.g. Shift_JIS kanji with lead 0xE0–0xEA), and physically removing
it
+ * would misalign the stream and manufacture failures on genuine CJK. Instead
we
+ * walk the bytes, skip positions that begin a valid UTF-8 sequence, and
decode the
+ * legacy charset in place everywhere else — so real CJK is never misaligned
and
+ * the rate errs toward <em>not</em> vetoing.
+ *
+ * <p>Does NOT catch the legal-but-wrong class (Latin bytes that form
<em>valid</em>
+ * CJK at ~0 failure) — that's the typicality layer's job.
+ */
+public final class CjkDecodeValidator {
+
+ private CjkDecodeValidator() {
+ }
+
+ /** Minimum legacy (non-UTF-8) high bytes required before the rate is
trusted. */
+ public static final int MIN_HIGH_BYTES = 30;
+
+ /**
+ * Failure rate of {@code bytes} under {@code cjkCharset}'s vendor
superset,
+ * counting only legacy high bytes (embedded UTF-8 is skipped, not
counted).
+ *
+ * @return failures / legacy-high-bytes, or {@code -1.0} when there is too
+ * little legacy evidence (legacy high bytes < {@link
#MIN_HIGH_BYTES})
+ */
+ public static double strippedFailureRate(byte[] bytes, Charset cjkCharset)
{
+ Charset decodeAs = CharsetSupersets.supersetOf(cjkCharset);
+ if (decodeAs == null) {
+ decodeAs = cjkCharset;
+ }
+ CharsetDecoder dec = decodeAs.newDecoder()
+ .onMalformedInput(CodingErrorAction.REPORT)
+ .onUnmappableCharacter(CodingErrorAction.REPORT);
+ CharBuffer one = CharBuffer.allocate(1);
+ int i = 0;
+ int n = bytes.length;
+ int fail = 0;
+ int nHigh = 0;
+ while (i < n) {
+ int x = bytes[i] & 0xFF;
+ if (x < 0x80) {
+ i++;
+ continue;
+ }
+ int ulen = utf8SequenceLength(bytes, i);
+ if (ulen > 0) {
+ i += ulen; // embedded UTF-8 — not legacy content, skip
+ continue;
+ }
+ nHigh++;
+ dec.reset();
+ one.clear();
+ ByteBuffer in = ByteBuffer.wrap(bytes, i, Math.min(4, n - i));
+ CoderResult r = dec.decode(in, one, true);
+ if (r.isError()) {
+ fail++;
+ i++;
+ } else {
+ int consumed = in.position() - i;
+ i += Math.max(1, consumed);
+ }
+ }
+ if (nHigh < MIN_HIGH_BYTES) {
+ return -1.0;
+ }
+ return (double) fail / nHigh;
+ }
+
+ /** True for the legacy multi-byte CJK charsets this veto applies to (the
+ * decode-failure signal is meaningful only for these; ISO-2022 is handled
+ * structurally and single-byte charsets don't apply). */
+ public static boolean appliesTo(String charsetName) {
+ String name = charsetName.toLowerCase(Locale.ROOT);
+ if (name.contains("2022")) {
+ return false; // escape-based, structural
+ }
+ return name.contains("gb") || name.contains("big5") ||
name.contains("euc")
+ || name.contains("shift") || name.contains("jis") ||
name.contains("949");
+ }
+
+ /** Length (2/3/4) of a valid UTF-8 multi-byte sequence starting at {@code
i},
+ * or 0 if none. Lead-byte ranges exclude overlong 2-byte (C0/C1) and
+ * out-of-range (≥F5) leads; continuations must be 0x80–0xBF. */
+ static int utf8SequenceLength(byte[] b, int i) {
+ int x = b[i] & 0xFF;
+ int len;
+ if (x >= 0xC2 && x <= 0xDF) {
+ len = 2;
+ } else if (x >= 0xE0 && x <= 0xEF) {
+ len = 3;
+ } else if (x >= 0xF0 && x <= 0xF4) {
+ len = 4;
+ } else {
+ return 0;
+ }
+ if (i + len > b.length) {
+ return 0;
+ }
+ for (int k = 1; k < len; k++) {
+ int c = b[i + k] & 0xFF;
+ if (c < 0x80 || c > 0xBF) {
+ return 0;
+ }
+ }
+ return len;
+ }
+}
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CosineFamilyArbiter.java
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CosineFamilyArbiter.java
new file mode 100644
index 0000000000..057bba4a92
--- /dev/null
+++
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CosineFamilyArbiter.java
@@ -0,0 +1,241 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.charset.Charset;
+import java.nio.charset.IllegalCharsetNameException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+
+import org.apache.tika.detect.EncodingResult;
+
+/**
+ * Family-level guard over the NB statistical pick, defending against the
+ * single-byte-→CJK collision (Cyrillic / Greek / accented-Latin content
+ * whose high bytes coincide with legal GBK lead/trail pairs and accumulate
+ * spurious GB18030 / Big5 likelihood under the multinomial NB).
+ *
+ * <p>Two complementary, model-light signals, both blind to NB:</p>
+ * <ul>
+ * <li><b>high-byte cosine</b> — cosine between the probe's high-byte
+ * (≥ 0x80) byte-bigram occupancy and each class's control-stripped
+ * high-byte profile. Direction-based, so length/density-invariant; the
+ * ASCII quadrant is dropped so shared English text can't dominate. When
+ * NB picks a CJK class but the cosine argmax is non-CJK (with enough
+ * high-byte evidence), the CJK pick is vetoed.</li>
+ * <li><b>GBK illegality</b> — fraction of high-byte lead bytes that do
+ * not begin a valid GBK 2-byte or GB18030 4-byte sequence. A genuine
+ * GB18030 document is ~0% illegal; Cyrillic/Greek forced through GBK
+ * throws illegal trails. Scoped to GB18030 only (it says nothing about
+ * Shift_JIS/EUC).</li>
+ * </ul>
+ *
+ * <p>On veto the CJK pick is replaced by the best non-CJK candidate (by cosine
+ * when evidence is sufficient, else the highest-ranked non-CJK NB candidate);
+ * real-CJK picks are left untouched (cosine argmax stays CJK, illegality ~0),
+ * so the guard is regression-safe for genuine CJK.</p>
+ */
+public final class CosineFamilyArbiter {
+
+ /** Minimum high-byte bigram count before the cosine veto is trusted. */
+ public static final int MIN_HIGH_BYTE_SUPPORT = 15;
+
+ /** GBK-illegality fraction above which a GB18030 pick is refuted. */
+ public static final double GBK_ILLEGAL_THRESHOLD = 0.02;
+
+ private static final String GB18030 = "GB18030";
+
+ private final String[] names;
+ private final boolean[] cjk;
+ private final Charset[] charsets; // resolved JVM charset, null if
unsupported
+ private final int[][] bigramIds;
+ private final float[][] weights; // L2-normalized per class
+
+ public CosineFamilyArbiter(InputStream in) throws IOException {
+ try (DataInputStream dis = new DataInputStream(in)) {
+ int nc = dis.readInt();
+ names = new String[nc];
+ cjk = new boolean[nc];
+ charsets = new Charset[nc];
+ bigramIds = new int[nc][];
+ weights = new float[nc][];
+ for (int c = 0; c < nc; c++) {
+ names[c] = dis.readUTF();
+ cjk[c] = isCjkName(names[c]);
+ charsets[c] = resolve(names[c]);
+ int nnz = dis.readInt();
+ int[] ids = new int[nnz];
+ float[] w = new float[nnz];
+ for (int k = 0; k < nnz; k++) {
+ ids[k] = dis.readUnsignedShort();
+ w[k] = dis.readFloat();
+ }
+ bigramIds[c] = ids;
+ weights[c] = w;
+ }
+ }
+ }
+
+ private static Charset resolve(String name) {
+ try {
+ return Charset.isSupported(name) ? Charset.forName(name) : null;
+ } catch (IllegalCharsetNameException e) {
+ return null;
+ }
+ }
+
+ static boolean isCjkName(String name) {
+ String n = name.toLowerCase(Locale.ROOT);
+ return n.contains("gb") || n.contains("big5") || n.contains("euc")
+ || n.contains("shift") || n.contains("jis") ||
n.contains("2022")
+ || n.contains("949");
+ }
+
+ /**
+ * Apply the family guard to NB's ranked candidates. Returns {@code
+ * nbResults} unchanged unless NB's top pick is CJK and a veto fires, in
+ * which case a non-CJK replacement is promoted to the front.
+ */
+ public List<EncodingResult> arbitrate(byte[] probe, List<EncodingResult>
nbResults) {
+ if (nbResults == null || nbResults.isEmpty()) {
+ return nbResults;
+ }
+ if (!isCjkName(nbResults.get(0).getCharset().name())) {
+ return nbResults;
+ }
+ // Build high-byte bigram occupancy.
+ Map<Integer, Integer> docMap = new HashMap<>();
+ long support = 0;
+ for (int i = 0; i + 1 < probe.length; i++) {
+ int b0 = probe[i] & 0xFF;
+ int b1 = probe[i + 1] & 0xFF;
+ if (b0 >= 0x80 || b1 >= 0x80) {
+ int bg = (b0 << 8) | b1;
+ docMap.merge(bg, 1, Integer::sum);
+ support++;
+ }
+ }
+ double docNorm = 0;
+ for (int v : docMap.values()) {
+ docNorm += (double) v * v;
+ }
+ docNorm = Math.sqrt(docNorm);
+
+ boolean gbkTop = GB18030.equals(nbResults.get(0).getCharset().name());
+ double illegal = gbkIllegalRate(probe);
+
+ int cosArg = -1;
+ double bestCos = -1;
+ double[] cos = new double[names.length];
+ if (docNorm > 0) {
+ for (int c = 0; c < names.length; c++) {
+ double dot = 0;
+ int[] ids = bigramIds[c];
+ float[] w = weights[c];
+ for (int k = 0; k < ids.length; k++) {
+ Integer dc = docMap.get(ids[k]);
+ if (dc != null) {
+ dot += w[k] * dc;
+ }
+ }
+ cos[c] = dot / docNorm;
+ if (cos[c] > bestCos) {
+ bestCos = cos[c];
+ cosArg = c;
+ }
+ }
+ }
+
+ boolean veto = (gbkTop && illegal > GBK_ILLEGAL_THRESHOLD)
+ || (support >= MIN_HIGH_BYTE_SUPPORT && cosArg >= 0 &&
!cjk[cosArg]);
+ if (!veto) {
+ return nbResults;
+ }
+
+ // Choose replacement: best non-CJK by cosine when evidence is
+ // sufficient, else the highest-ranked non-CJK NB candidate.
+ Charset replacement = null;
+ if (support >= MIN_HIGH_BYTE_SUPPORT && docNorm > 0) {
+ double bv = -1;
+ for (int c = 0; c < names.length; c++) {
+ if (!cjk[c] && charsets[c] != null && cos[c] > bv) {
+ bv = cos[c];
+ replacement = charsets[c];
+ }
+ }
+ }
+ float conf = nbResults.get(0).getConfidence();
+ List<EncodingResult> out = new ArrayList<>(nbResults.size() + 1);
+ if (replacement != null) {
+ out.add(new EncodingResult(replacement, conf, replacement.name(),
+ EncodingResult.ResultType.STATISTICAL));
+ }
+ for (EncodingResult r : nbResults) {
+ if (isCjkName(r.getCharset().name())) {
+ continue;
+ }
+ if (replacement != null &&
r.getCharset().name().equals(replacement.name())) {
+ continue;
+ }
+ out.add(r);
+ }
+ // If we couldn't form any non-CJK candidate, don't strand the caller
+ // with an empty list — leave NB's result untouched.
+ return out.isEmpty() ? nbResults : out;
+ }
+
+ /**
+ * Fraction of high-byte lead bytes that fail to begin a valid GBK 2-byte
+ * or GB18030 4-byte sequence. 0 for genuine GB18030.
+ */
+ static double gbkIllegalRate(byte[] b) {
+ int n = b.length;
+ int i = 0;
+ int illegal = 0;
+ int lead = 0;
+ while (i < n) {
+ int c = b[i] & 0xFF;
+ if (c < 0x80) {
+ i++;
+ continue;
+ }
+ lead++;
+ if (c >= 0x81 && c <= 0xFE && i + 1 < n) {
+ int t = b[i + 1] & 0xFF;
+ if (((t >= 0x40 && t <= 0x7E) || (t >= 0x80 && t <= 0xFE)) &&
t != 0x7F) {
+ i += 2;
+ continue;
+ }
+ if (t >= 0x30 && t <= 0x39 && i + 3 < n
+ && (b[i + 2] & 0xFF) >= 0x81 && (b[i + 2] & 0xFF) <=
0xFE
+ && (b[i + 3] & 0xFF) >= 0x30 && (b[i + 3] & 0xFF) <=
0x39) {
+ i += 4;
+ continue;
+ }
+ }
+ illegal++;
+ i++;
+ }
+ return lead == 0 ? 0 : (double) illegal / lead;
+ }
+}
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java
index 00254dcd96..45a919274a 100644
---
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java
+++
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java
@@ -30,6 +30,7 @@ import org.slf4j.LoggerFactory;
import org.apache.tika.config.TikaComponent;
import org.apache.tika.detect.EncodingDetector;
import org.apache.tika.detect.EncodingResult;
+import org.apache.tika.detect.HighByteLetterStats;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
@@ -124,6 +125,19 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
*/
private static final float UTF8_STRUCTURAL_CONF = 0.95f;
+ /** Confidence for an ISO-2022-JP/KR/CN structural candidate (7-bit,
escape-based). */
+ private static final float ISO2022_STRUCTURAL_CONF = 0.95f;
+
+ /** ISO-2022 decode-verify: a stray {@code ESC $} in plain ASCII must not
win, so
+ * require the decode to yield real CJK at near-zero replacement rate. */
+ private static final int ISO2022_MIN_CJK = 4;
+ private static final double ISO2022_MAX_FFFD_RATE = 0.05;
+
+ /** False-CJK veto: drop an NB legacy-CJK candidate whose UTF-8-stripped
decode
+ * fails above this rate. Post-strip, real CJK (pure or mixed) is ≤1.6%
and
+ * genuine false-CJK ≥5.3%, so ~2.5% separates them (see
CjkDecodeValidator). */
+ private static final double CJK_FAILURE_VETO_THRESHOLD = 0.025;
+
/** Confidence for the windows-1252 fallback emitted on empty/ASCII
probes. */
private static final float FALLBACK_CONFIDENCE = 0.1f;
@@ -212,7 +226,7 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
public List<EncodingResult> detect(byte[] probe, Metadata metadata) {
if (LOG.isTraceEnabled()) {
int probeLen = probe == null ? 0 : probe.length;
- int highBytes = probe == null ? 0 : countHighBytes(probe);
+ int highBytes = probe == null ? 0 :
HighByteLetterStats.countHighBytes(probe);
LOG.trace("mojibuster enter probe={}B highBytes={}", probeLen,
highBytes);
}
// Empty / near-empty probes: return the WHATWG default so
@@ -233,6 +247,18 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
// consulting NB so we don't hand back a bias-driven x-MacRoman
// or IBM850 pick.
if (isPureAscii(probe)) {
+ // ISO-2022-JP/KR/CN are 7-bit escape-based encodings: NB sees no
high
+ // bytes, so without this they fall to the windows-1252 default and
+ // decode to gibberish (a 4.x-vs-3.x regression; icu4j catches
them).
+ // Gated to the pure-ASCII branch on purpose — high-byte binary
that
+ // happens to contain an ESC sequence never reaches here, it takes
the
+ // normal NB path. decode-verify guards the rare 7-bit stray-ESC
case.
+ Charset iso2022 = detectIso2022Verified(probe);
+ if (iso2022 != null) {
+ LOG.trace("mojibuster -> {} (iso-2022 structural)",
iso2022.name());
+ return List.of(new EncodingResult(iso2022,
ISO2022_STRUCTURAL_CONF,
+ iso2022.name(), EncodingResult.ResultType.STRUCTURAL));
+ }
LOG.trace("mojibuster -> windows-1252 fallback (pure ASCII)");
return windows1252Fallback();
}
@@ -327,7 +353,13 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
}
}
LOG.trace("mojibuster utf8Check={} tolerated={}", utf8, utf8Tolerated);
- if (utf8 == StructuralEncodingRules.Utf8Result.LIKELY_UTF8) {
+ // Emit a structural UTF-8 candidate when the grammar is clean (LIKELY)
+ // OR essentially-UTF-8 (NOT_UTF8 with malformed bytes within
tolerance —
+ // a few corrupt bytes in otherwise-valid UTF-8). Both exclude legacy
+ // CJK, which produces many grammar errors (measured: 0/321K labeled
CJK
+ // samples return LIKELY or fall within tolerance). The type-priority
+ // sort in sortAndDedup then ranks this above NB's statistical pick.
+ if (utf8 == StructuralEncodingRules.Utf8Result.LIKELY_UTF8 ||
utf8Tolerated) {
pool.add(new EncodingResult(
java.nio.charset.StandardCharsets.UTF_8,
UTF8_STRUCTURAL_CONF, "UTF-8",
@@ -361,6 +393,18 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
&& !utf8Tolerated) {
continue;
}
+ // False-CJK veto: a legacy multi-byte CJK pick whose bytes don't
+ // validate (high decode-failure on the UTF-8-stripped remainder)
is
+ // Latin/Cyrillic/garbage mis-read as CJK. Drop it — if it was
NB's
+ // only candidate the pool empties and the windows-1252 fallback
wins.
+ if (CjkDecodeValidator.appliesTo(name)) {
+ double failRate =
CjkDecodeValidator.strippedFailureRate(nbInput, r.getCharset());
+ if (failRate >= CJK_FAILURE_VETO_THRESHOLD) {
+ LOG.trace("mojibuster veto {} (cjk decode-failure {}%)",
name,
+ String.format(Locale.ROOT, "%.2f", failRate *
100));
+ continue;
+ }
+ }
pool.add(r);
}
@@ -404,6 +448,50 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
EncodingResult.ResultType.STATISTICAL));
}
+ /**
+ * Detect ISO-2022-JP/KR/CN by escape sequence, then verify the decode is
+ * real CJK (not a stray {@code ESC $} in ASCII text). Returns the charset
+ * or {@code null}. Caller guarantees {@code probe} is pure 7-bit ASCII.
+ */
+ private static Charset detectIso2022Verified(byte[] probe) {
+ Charset cs = StructuralEncodingRules.detectIso2022(probe);
+ if (cs == null) {
+ return null;
+ }
+ String decoded;
+ try {
+ decoded = new String(probe, cs); // REPLACE on malformed/unmappable
+ } catch (Exception e) {
+ return null;
+ }
+ int cjk = 0;
+ int fffd = 0;
+ for (int i = 0; i < decoded.length(); ) {
+ int cp = decoded.codePointAt(i);
+ i += Character.charCount(cp);
+ if (cp == 0xFFFD) {
+ fffd++;
+ } else if (isCjkChar(cp)) {
+ cjk++;
+ }
+ }
+ if (cjk >= ISO2022_MIN_CJK
+ && fffd <= decoded.length() * ISO2022_MAX_FFFD_RATE) {
+ return cs;
+ }
+ return null;
+ }
+
+ /** Han / kana / hangul / CJK punctuation — the scripts ISO-2022-JP/KR/CN
carry. */
+ private static boolean isCjkChar(int cp) {
+ return (cp >= 0x3040 && cp <= 0x30FF) // hiragana + katakana
+ || (cp >= 0x4E00 && cp <= 0x9FFF) // CJK unified
+ || (cp >= 0x3400 && cp <= 0x4DBF) // CJK ext A
+ || (cp >= 0xAC00 && cp <= 0xD7A3) // hangul syllables
+ || (cp >= 0xFF66 && cp <= 0xFF9F) // halfwidth katakana
+ || (cp >= 0x3000 && cp <= 0x303F); // CJK symbols/punctuation
+ }
+
/**
* Pure 7-bit ASCII test: no bytes ≥ 0x80 and no null bytes.
* Null-byte exclusion prevents misclassifying UTF-16/32 content
@@ -542,8 +630,8 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
return ranked;
}
Charset win1252 = Charset.forName(WIN1252);
- int winLetters = countHighByteLetters(probe, win1252);
- int topLetters = countHighByteLetters(probe, top.getCharset());
+ int winLetters = HighByteLetterStats.countCasedHighByteLetters(probe,
win1252);
+ int topLetters = HighByteLetterStats.countCasedHighByteLetters(probe,
top.getCharset());
// Tie goes to windows-1252 (WHATWG-canonical default).
if (winLetters < topLetters) {
return ranked;
@@ -558,85 +646,23 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
}
/**
- * Decode the probe under {@code cs} and count codepoints that
- * are Unicode "cased letters" (Lu / Ll / Lt) at codepoints ≥
- * 0x80. Used by the Latin sibling fallback to compare decoded-
- * text quality between two candidate SBCS encodings.
- *
- * <p>Deliberately excludes a few "letter-ish but typographic"
- * categories that {@link Character#isLetter(int)} would otherwise
- * count, because they fooled the rule in earlier evals:</p>
- * <ul>
- * <li><b>Modifier letters (Lm)</b>: spacing-modifier letterlike
- * symbols (ʰ ʷ ˆ ˜ ʻ etc.) that some encodings put at
- * byte positions where the truthful encoding has a symbol /
- * punctuation.</li>
- * <li><b>Ordinal indicators</b>: U+00AA (ª), U+00BA (º),
- * U+207F (ⁿ), U+2122 (™ — not Ll, included for safety).
- * MacRoman's 0xBB and 0xBC are ª / º respectively; the
- * windows-1252 truth for byte 0xBB is » (final punctuation,
- * not a letter). Without this exclusion, MacRoman's
- * letter count beats win-1252's on probes where » appears.</li>
- * <li><b>Other letter (Lo)</b>: covers CJK / Korean letterlike
- * codepoints that occasionally fall out of byte-level
- * decodes; counting those as "Latin letters" would mislead
- * the Latin-sibling comparison.</li>
- * </ul>
- */
- private static int countHighByteLetters(byte[] probe, Charset cs) {
- String decoded;
- try {
- decoded = new String(probe, cs);
- } catch (Exception e) {
- return 0;
- }
- int count = 0;
- for (int i = 0; i < decoded.length(); ) {
- int cp = decoded.codePointAt(i);
- if (cp >= 0x80 && isCasedLatinishLetter(cp)) {
- count++;
- }
- i += Character.charCount(cp);
- }
- return count;
- }
-
- /**
- * Returns true for codepoints in Unicode's "cased letter"
- * categories (Lu / Ll / Lt) but EXCLUDING specific letterlike
- * typographic symbols (ª, º, ⁿ). See {@link #countHighByteLetters}.
- */
- private static boolean isCasedLatinishLetter(int cp) {
- if (cp == 0x00AA || cp == 0x00BA || cp == 0x207F) {
- return false; // ª, º, ⁿ — ordinal / superscript indicators
- }
- int type = Character.getType(cp);
- return type == Character.UPPERCASE_LETTER
- || type == Character.LOWERCASE_LETTER
- || type == Character.TITLECASE_LETTER;
- }
-
- private static int countHighBytes(byte[] probe) {
- int n = 0;
- for (byte b : probe) {
- if ((b & 0xFF) >= 0x80) {
- n++;
- }
- }
- return n;
- }
-
- /**
- * Sort pool by confidence descending, deduplicate by charset name
- * keeping the highest-confidence instance. Stable ordering is
- * good enough for current needs; if we need trust-type tiebreaks
- * (STRUCTURAL > DECLARATIVE > STATISTICAL) later, add here.
+ * Sort pool by trust type (STRUCTURAL > DECLARATIVE > STATISTICAL),
+ * then by confidence within a type, and deduplicate by charset name
keeping
+ * the first (highest-priority) instance. Type priority is load-bearing:
+ * NB pins its statistical winner to confidence 1.0, so a structural
+ * candidate (UTF-8 grammar proof, UTF-32 codepoint validity) emitted below
+ * 1.0 would otherwise lose the sort to NB despite being the stronger
signal.
*/
private static List<EncodingResult> sortAndDedup(List<EncodingResult>
pool) {
if (pool.isEmpty()) {
return Collections.emptyList();
}
- pool.sort((a, b) -> Float.compare(b.getConfidence(),
a.getConfidence()));
+ pool.sort((a, b) -> {
+ int byType = Integer.compare(typeRank(a.getResultType()),
+ typeRank(b.getResultType()));
+ return byType != 0 ? byType
+ : Float.compare(b.getConfidence(), a.getConfidence());
+ });
java.util.Set<String> seen = new java.util.LinkedHashSet<>();
List<EncodingResult> out = new java.util.ArrayList<>(pool.size());
for (EncodingResult r : pool) {
@@ -647,6 +673,18 @@ public class MojibusterEncodingDetector implements
EncodingDetector {
return out;
}
+ /** Trust-type priority for sorting: lower wins. */
+ private static int typeRank(EncodingResult.ResultType t) {
+ switch (t) {
+ case STRUCTURAL:
+ return 0;
+ case DECLARATIVE:
+ return 1;
+ default:
+ return 2;
+ }
+ }
+
/**
* Returns stripped bytes if the probe contains well-formed HTML/XML
* tags; otherwise returns the original probe unchanged.
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java
index 4140b6f023..5becf20ce6 100644
---
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java
+++
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java
@@ -143,6 +143,22 @@ public class NaiveBayesBigramEncodingDetector implements
EncodingDetector {
*/
public static final int MIN_BIGRAMS_FOR_DIVERSITY_GATE = 100;
+ /**
+ * Sublinear count weighting ("count clipping"). A distinct bigram's raw
+ * repetition count {@code n} is replaced by {@code 1 + ln(n)} before it
+ * weights the per-class contribution, so a bigram repeated hundreds of
+ * times (e.g. a {@code "--"} separator run, observed 864× on one page)
+ * can no longer dominate the score by sheer volume.
+ *
+ * <p>Length-dynamic by construction (no fixed cap) and
<em>class-agnostic</em>:
+ * it bounds <em>repetition</em>, an axis orthogonal to the Type C cap
+ * (which bounds a single class's per-bigram cross-class advantage) and the
+ * Type A diversity gate (which abstains only on globally-degenerate
input).
+ * Partial concentration — one bigram repeated heavily inside an otherwise
+ * diverse probe — falls through all three of those guards; this closes
it.</p>
+ */
+ public static final boolean SUBLINEAR_COUNT = true;
+
/**
* Script / writing-system family used by {@link #CAP_PER_BIGRAM_NATS}.
* UTF-8 stands alone so the cap engages on UTF-vs-anything pairs
@@ -513,7 +529,10 @@ public class NaiveBayesBigramEncodingDetector implements
EncodingDetector {
}
int n = counts.countAt(slot);
int w = idf8[bigram];
- double countTimesIdf = (double) n * w;
+ // Sublinear count weighting: cap a repeated bigram's volume so a
+ // separator run (e.g. "--" x864) can't dominate by repetition.
+ double tf = (SUBLINEAR_COUNT && n > 1) ? (1.0 + Math.log(n)) : n;
+ double countTimesIdf = tf * w;
int base = bigram * numClasses;
if (!applyCap) {
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/cosine-profiles.bin
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/cosine-profiles.bin
new file mode 100644
index 0000000000..646a7e7923
Binary files /dev/null and
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/cosine-profiles.bin
differ
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/nb-bigram.bin
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/nb-bigram.bin
index bcfce41d67..b89188bb32 100644
Binary files
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/nb-bigram.bin
and
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/nb-bigram.bin
differ
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/CjkDecodeValidatorTest.java
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/CjkDecodeValidatorTest.java
new file mode 100644
index 0000000000..14e074212a
--- /dev/null
+++
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/CjkDecodeValidatorTest.java
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import java.io.ByteArrayOutputStream;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+
+import org.junit.jupiter.api.Test;
+
+public class CjkDecodeValidatorTest {
+
+ @Test
+ public void realJapaneseFarBelowVetoThreshold() throws Exception {
+ byte[] b = ("日本語のテスト文章をたくさん書いて高バイトを十分に確保します"
+ + "これは本物の日本語です").getBytes("Shift_JIS");
+ double rate = CjkDecodeValidator.strippedFailureRate(b,
Charset.forName("Shift_JIS"));
+ assertTrue(rate >= 0.0 && rate < 0.025, "real JP should be near-zero
failure, got " + rate);
+ }
+
+ @Test
+ public void realKoreanFarBelowVetoThreshold() throws Exception {
+ byte[] b = ("안녕하세요 이것은 진짜 한국어 문장입니다 고바이트를 충분히 확보하기 위해 "
+ + "여러 글자를 적습니다").getBytes("EUC-KR");
+ double rate = CjkDecodeValidator.strippedFailureRate(b,
Charset.forName("EUC-KR"));
+ assertTrue(rate >= 0.0 && rate < 0.025, "real KR should be near-zero
failure, got " + rate);
+ }
+
+ /** Mixed-encoding: legacy CJK body + an embedded UTF-8 run. Stripping the
UTF-8
+ * run de-confounds, so the rate stays low (the WS2 breakthrough). */
+ @Test
+ public void mixedLegacyPlusUtf8NotVetoed() throws Exception {
+ ByteArrayOutputStream bo = new ByteArrayOutputStream();
+ bo.writeBytes("日本語の本文をしっかり書いて高バイトを確保する本物のテキスト".getBytes("Shift_JIS"));
+ bo.writeBytes("これはUTF-8の埋め込みウィジェット".getBytes("UTF-8")); // embedded
UTF-8
+ double rate = CjkDecodeValidator.strippedFailureRate(bo.toByteArray(),
+ Charset.forName("Shift_JIS"));
+ assertTrue(rate >= 0.0 && rate < 0.025, "mixed real CJK should stay
low post-strip, got " + rate);
+ }
+
+ @Test
+ public void garbageHighBytesVetoed() {
+ byte[] b = new byte[60];
+ Arrays.fill(b, (byte) 0xFF); // 0xFF is not a valid GB18030 lead → all
malformed
+ double rate = CjkDecodeValidator.strippedFailureRate(b,
Charset.forName("GB18030"));
+ assertTrue(rate >= 0.025, "garbage high bytes should be vetoed, got "
+ rate);
+ }
+
+ @Test
+ public void insufficientHighBytesReturnsMinusOne() {
+ byte[] b = "mostly ascii with a couple high
bytes".getBytes(java.nio.charset.StandardCharsets.ISO_8859_1);
+ assertEquals(-1.0, CjkDecodeValidator.strippedFailureRate(b,
Charset.forName("GB18030")));
+ }
+
+ @Test
+ public void appliesToLegacyCjkButNotIso2022OrLatin() {
+ assertTrue(CjkDecodeValidator.appliesTo("GB18030"));
+ assertTrue(CjkDecodeValidator.appliesTo("Shift_JIS"));
+ assertTrue(CjkDecodeValidator.appliesTo("Big5-HKSCS"));
+ assertTrue(CjkDecodeValidator.appliesTo("x-windows-949"));
+ assertEquals(false, CjkDecodeValidator.appliesTo("ISO-2022-JP"));
+ assertEquals(false, CjkDecodeValidator.appliesTo("windows-1252"));
+ }
+}
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/Iso2022DetectionTest.java
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/Iso2022DetectionTest.java
new file mode 100644
index 0000000000..30b0332a01
--- /dev/null
+++
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/Iso2022DetectionTest.java
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotEquals;
+
+import java.io.ByteArrayOutputStream;
+import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
+
+import org.junit.jupiter.api.Test;
+
+import org.apache.tika.detect.EncodingResult;
+
+/**
+ * WS3: ISO-2022-JP/KR/CN are 7-bit escape-based encodings invisible to the NB
+ * bigram model; the detector recognizes them structurally inside the
pure-ASCII
+ * branch. Binary FPs (high-byte) never reach that branch, and a stray {@code
+ * ESC $} in real ASCII is rejected by the decode-verify.
+ */
+public class Iso2022DetectionTest {
+
+ private final MojibusterEncodingDetector det = newDetector();
+
+ private static MojibusterEncodingDetector newDetector() {
+ try {
+ return new MojibusterEncodingDetector();
+ } catch (Exception e) {
+ throw new RuntimeException(e);
+ }
+ }
+
+ @Test
+ public void detectsRealIso2022Jp() throws Exception {
+ byte[] b = ("日本語のテスト文章です。これは ISO-2022-JP でエンコードされた"
+ + "純粋に7ビットの文書です。").getBytes("ISO-2022-JP");
+ EncodingResult top = det.detect(b).get(0);
+ assertEquals("ISO-2022-JP", top.getCharset().name());
+ assertEquals(EncodingResult.ResultType.STRUCTURAL,
top.getResultType());
+ }
+
+ @Test
+ public void detectsRealIso2022Kr() throws Exception {
+ byte[] b = ("안녕하세요 이것은 ISO-2022-KR 로 인코딩된 한국어 문서입니다 "
+ + "순수한 7비트 텍스트입니다").getBytes("ISO-2022-KR");
+ assertEquals("ISO-2022-KR", det.detect(b).get(0).getCharset().name());
+ }
+
+ @Test
+ public void plainAsciiIsNotIso2022() {
+ byte[] b = "Hello world, this is ordinary 7-bit ASCII prose with no
escapes."
+ .getBytes(StandardCharsets.US_ASCII);
+ Charset top = det.detect(b).get(0).getCharset();
+ assertNotEquals("ISO-2022-JP", top.name());
+ assertNotEquals("ISO-2022-KR", top.name());
+ }
+
+ /** A real {@code ESC(0x1B) $ B} with an empty JIS section embedded in
ASCII
+ * yields zero CJK, so the decode-verify must reject it (not crown
ISO-2022-JP). */
+ @Test
+ public void strayEscapeInAsciiIsNotIso2022() {
+ ByteArrayOutputStream bo = new ByteArrayOutputStream();
+ bo.writeBytes("terminal log dump:
".getBytes(StandardCharsets.US_ASCII));
+ bo.writeBytes(new byte[] {0x1b, '$', 'B', 0x1b, '(', 'B'}); // enter
then exit JIS
+ bo.writeBytes("back to ascii
output".getBytes(StandardCharsets.US_ASCII));
+ assertNotEquals("ISO-2022-JP",
det.detect(bo.toByteArray()).get(0).getCharset().name());
+ }
+}
diff --git
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java
index b4679c1845..06e5486cb4 100644
---
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java
+++
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparer.java
@@ -68,6 +68,8 @@ public class ExtractComparer extends ProfilerBase {
public static TableInfo EMBEDDED_FILE_PATH_TABLE_B = new
TableInfo("emb_path_b", ExtractProfiler.EMBEDDED_FILE_PATH_TABLE.getColInfos());
public static TableInfo CONTENTS_TABLE_A = new TableInfo("contents_a",
ExtractProfiler.CONTENTS_TABLE.getColInfos());
public static TableInfo CONTENTS_TABLE_B = new TableInfo("contents_b",
ExtractProfiler.CONTENTS_TABLE.getColInfos());
+ public static TableInfo ENCODINGS_TABLE_A = new TableInfo("encodings_a",
ExtractProfiler.ENCODINGS_TABLE.getColInfos());
+ public static TableInfo ENCODINGS_TABLE_B = new TableInfo("encodings_b",
ExtractProfiler.ENCODINGS_TABLE.getColInfos());
public static TableInfo TAGS_TABLE_A = new TableInfo("tags_a",
ExtractProfiler.TAGS_TABLE.getColInfos());
public static TableInfo TAGS_TABLE_B = new TableInfo("tags_b",
ExtractProfiler.TAGS_TABLE.getColInfos());
public static TableInfo EXCEPTION_TABLE_A = new TableInfo("exceptions_a",
ExtractProfiler.EXCEPTION_TABLE.getColInfos());
@@ -207,6 +209,7 @@ public class ExtractComparer extends ProfilerBase {
writeTagData(fileId, contentTagsA, TAGS_TABLE_A);
writeProfileData(fpsA, i, contentTagsA, metadataA, fileId,
containerID, numAttachmentsA, PROFILES_A);
+ writeEncodingData(fileId, metadataA, ENCODINGS_TABLE_A);
writeExceptionData(fileId, metadataA, EXCEPTION_TABLE_A);
int matchIndex = getMatch(i, sharedDigestKey, emptyDigest,
handledB, metadataListA, metadataListB);
@@ -218,6 +221,7 @@ public class ExtractComparer extends ProfilerBase {
contentTagsB = getContent(fpsB, metadataB);
writeTagData(fileId, contentTagsB, TAGS_TABLE_B);
writeProfileData(fpsB, i, contentTagsB, metadataB, fileId,
containerID, numAttachmentsB, PROFILES_B);
+ writeEncodingData(fileId, metadataB, ENCODINGS_TABLE_B);
writeExceptionData(fileId, metadataB, EXCEPTION_TABLE_B);
}
writeEmbeddedFilePathData(i, fileId, metadataA, metadataB);
@@ -263,6 +267,7 @@ public class ExtractComparer extends ProfilerBase {
String fileId = (i == 0) ? containerID :
Integer.toString(ID.getAndIncrement());
writeTagData(fileId, contentTagsB, TAGS_TABLE_B);
writeProfileData(fpsB, i, contentTagsB, metadataB, fileId,
containerID, numAttachmentsB, PROFILES_B);
+ writeEncodingData(fileId, metadataB, ENCODINGS_TABLE_B);
writeEmbeddedFilePathData(i, fileId, null, metadataB);
writeExceptionData(fileId, metadataB, EXCEPTION_TABLE_B);
diff --git
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparerRunner.java
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparerRunner.java
index 3535574233..3ffcc5a31d 100644
---
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparerRunner.java
+++
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractComparerRunner.java
@@ -343,6 +343,7 @@ public class ExtractComparerRunner {
tableInfosA.add(ExtractComparer.EXCEPTION_TABLE_A);
tableInfosA.add(ExtractComparer.TAGS_TABLE_A);
tableInfosA.add(ExtractComparer.CONTENTS_TABLE_A);
+ tableInfosA.add(ExtractComparer.ENCODINGS_TABLE_A);
tableInfosA.add(ExtractComparer.EXTRACT_EXCEPTION_TABLE_A);
tableInfosA.add(ExtractComparer.EMBEDDED_FILE_PATH_TABLE_A);
@@ -351,6 +352,7 @@ public class ExtractComparerRunner {
tableInfosB.add(ExtractComparer.EXTRACT_EXCEPTION_TABLE_B);
tableInfosB.add(ExtractComparer.TAGS_TABLE_B);
tableInfosB.add(ExtractComparer.CONTENTS_TABLE_B);
+ tableInfosB.add(ExtractComparer.ENCODINGS_TABLE_B);
tableInfosB.add(ExtractComparer.EMBEDDED_FILE_PATH_TABLE_B);
tableInfosAandB.add(ExtractComparer.COMPARISON_CONTAINERS);
diff --git
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfileRunner.java
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfileRunner.java
index b7acb7c684..fca4c61473 100644
---
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfileRunner.java
+++
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfileRunner.java
@@ -256,6 +256,7 @@ public class ExtractProfileRunner {
tableInfos.add(ExtractProfiler.EXTRACT_EXCEPTION_TABLE);
tableInfos.add(ExtractProfiler.EXCEPTION_TABLE);
tableInfos.add(ExtractProfiler.CONTENTS_TABLE);
+ tableInfos.add(ExtractProfiler.ENCODINGS_TABLE);
tableInfos.add(ExtractProfiler.TAGS_TABLE);
tableInfos.add(ExtractProfiler.EMBEDDED_FILE_PATH_TABLE);
this.tableInfos = Collections.unmodifiableList(tableInfos);
diff --git
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfiler.java
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfiler.java
index 0073f9ddbb..551f92d11a 100644
---
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfiler.java
+++
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ExtractProfiler.java
@@ -54,9 +54,17 @@ public class ExtractProfiler extends ProfilerBase {
new ColInfo(Cols.ATTACHMENT_TYPE, Types.VARCHAR, 32), new
ColInfo(Cols.FILE_EXTENSION, Types.VARCHAR, 12), new ColInfo(Cols.MIME_ID,
Types.INTEGER),
new ColInfo(Cols.ELAPSED_TIME_MILLIS, Types.INTEGER), new
ColInfo(Cols.NUM_ATTACHMENTS, Types.INTEGER), new
ColInfo(Cols.NUM_METADATA_VALUES, Types.INTEGER),
new ColInfo(Cols.NUM_PAGES, Types.INTEGER), new
ColInfo(Cols.NUM_OCR_PAGES, Types.INTEGER), new ColInfo(Cols.HAS_CONTENT,
Types.BOOLEAN));
+ /** Charset detection per file (one row only when detection ran): final
pick,
+ * winning detector, declared charset from metadata (Content-Type-Hint). */
+ public static TableInfo ENCODINGS_TABLE = new TableInfo("encodings",
+ new ColInfo(Cols.ID, Types.INTEGER, "PRIMARY KEY"),
+ new ColInfo(Cols.DETECTED_ENCODING, Types.VARCHAR, 64),
+ new ColInfo(Cols.ENCODING_DETECTOR, Types.VARCHAR, 64),
+ new ColInfo(Cols.DECLARED_METADATA, Types.VARCHAR, 128));
public static TableInfo EMBEDDED_FILE_PATH_TABLE =
new TableInfo("emb_file_names", new ColInfo(Cols.ID,
Types.INTEGER, "PRIMARY KEY"), new ColInfo(Cols.EMBEDDED_FILE_PATH,
Types.VARCHAR, 1024));
public static TableInfo CONTENTS_TABLE = new TableInfo("contents", new
ColInfo(Cols.ID, Types.INTEGER, "PRIMARY KEY"), new
ColInfo(Cols.CONTENT_LENGTH, Types.INTEGER),
+ new ColInfo(Cols.NUM_REPLACEMENT, Types.INTEGER), new
ColInfo(Cols.NUM_NON_ASCII, Types.INTEGER),
new ColInfo(Cols.NUM_UNIQUE_TOKENS, Types.INTEGER), new
ColInfo(Cols.NUM_TOKENS, Types.INTEGER), new ColInfo(Cols.COMMON_TOKENS_LANG,
Types.VARCHAR, 12),
new ColInfo(Cols.NUM_UNIQUE_COMMON_TOKENS, Types.INTEGER), new
ColInfo(Cols.NUM_COMMON_TOKENS, Types.INTEGER),
new ColInfo(Cols.NUM_UNIQUE_ALPHABETIC_TOKENS, Types.INTEGER), new
ColInfo(Cols.NUM_ALPHABETIC_TOKENS, Types.INTEGER), new ColInfo(Cols.OOV,
Types.DOUBLE),
@@ -146,6 +154,7 @@ public class ExtractProfiler extends ProfilerBase {
String fileId = (i == 0) ? containerIdString :
Integer.toString(ID.incrementAndGet());
writeTagData(fileId, contentTags, TAGS_TABLE);
writeProfileData(fps, i, contentTags, m, fileId,
containerIdString, numAttachments, PROFILE_TABLE);
+ writeEncodingData(fileId, m, ENCODINGS_TABLE);
writeEmbeddedPathData(i, fileId, m, EMBEDDED_FILE_PATH_TABLE);
writeExceptionData(fileId, m, EXCEPTION_TABLE);
try {
diff --git
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ProfilerBase.java
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ProfilerBase.java
index 9c273b80db..6942c3b5a8 100644
---
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ProfilerBase.java
+++
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/ProfilerBase.java
@@ -51,6 +51,8 @@ import
org.apache.tika.eval.core.textstats.BasicTokenCountStatsCalculator;
import org.apache.tika.eval.core.textstats.CommonTokens;
import org.apache.tika.eval.core.textstats.CompositeTextStatsCalculator;
import org.apache.tika.eval.core.textstats.ContentLengthCalculator;
+import org.apache.tika.eval.core.textstats.NonAsciiCharCounter;
+import org.apache.tika.eval.core.textstats.ReplacementCharCounter;
import org.apache.tika.eval.core.textstats.TextStatsCalculator;
import org.apache.tika.eval.core.textstats.TokenEntropy;
import org.apache.tika.eval.core.textstats.TokenLengths;
@@ -324,6 +326,8 @@ public abstract class ProfilerBase {
calculators.add(new TopNTokens(10));
calculators.add(new BasicTokenCountStatsCalculator());
calculators.add(new ContentLengthCalculator());
+ calculators.add(new ReplacementCharCounter());
+ calculators.add(new NonAsciiCharCounter());
calculators.add(new UnicodeBlockCounter(maxContentLengthForLangId));
return new CompositeTextStatsCalculator(calculators, analyzerManager,
langIder);
@@ -497,6 +501,19 @@ public abstract class ProfilerBase {
return;
}
data.put(Cols.CONTENT_LENGTH, Integer.toString(length));
+ Integer numReplacement = (Integer)
textStats.get(ReplacementCharCounter.class);
+ if (numReplacement != null) {
+ data.put(Cols.NUM_REPLACEMENT,
Integer.toString(numReplacement));
+ }
+ // Store raw counts only; derive the FFFD rate in SQL. Decode
failures
+ // come only from high bytes, so num_replacement/num_non_ascii is
the
+ // un-diluted rate (num_replacement/content_length dilutes to ~0 on
+ // ASCII-dominated docs). U+FFFD is itself >= 0x80, so it is
counted in
+ // num_non_ascii and both ratios stay in [0,1].
+ Integer numNonAscii = (Integer)
textStats.get(NonAsciiCharCounter.class);
+ if (numNonAscii != null) {
+ data.put(Cols.NUM_NON_ASCII, Integer.toString(numNonAscii));
+ }
}
langid(textStats, data);
@@ -541,6 +558,35 @@ public abstract class ProfilerBase {
}
}
+ /**
+ * Per-file charset-detection record: the final detected encoding, the
+ * detector that won, and the declared charset from metadata
+ * (Content-Type-Hint). Writes a row only when a detected encoding is
+ * present, so the table holds only files that ran charset detection.
+ */
+ protected void writeEncodingData(String fileId, Metadata m, TableInfo
encodingsTable) {
+ String detected = m.get(TikaCoreProperties.DETECTED_ENCODING);
+ if (detected == null) {
+ return;
+ }
+ Map<Cols, String> data = new HashMap<>();
+ data.put(Cols.ID, fileId);
+ data.put(Cols.DETECTED_ENCODING, detected);
+ String detector = m.get(TikaCoreProperties.ENCODING_DETECTOR);
+ if (detector != null) {
+ data.put(Cols.ENCODING_DETECTOR, detector);
+ }
+ String declared = m.get(TikaCoreProperties.CONTENT_TYPE_HINT);
+ if (declared != null) {
+ data.put(Cols.DECLARED_METADATA, declared);
+ }
+ try {
+ writer.writeRow(encodingsTable, data);
+ } catch (IOException e) {
+ throw new RuntimeException(e);
+ }
+ }
+
void writeTagData(String fileId, ContentTags contentTags, TableInfo
tagsTable) {
Map<String, Integer> tags = contentTags.getTags();
if (tags.size() == 0 && contentTags.getParseException() == false) {
diff --git
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/db/Cols.java
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/db/Cols.java
index 0724f7f16e..6aa5f22259 100644
---
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/db/Cols.java
+++
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/db/Cols.java
@@ -26,9 +26,13 @@ public enum Cols {
//profile table
ID, LENGTH, FILE_NAME, FILE_EXTENSION, ELAPSED_TIME_MILLIS,
NUM_METADATA_VALUES, IS_EMBEDDED, EMBEDDED_FILE_PATH, MIME_ID, TIKA_MIME_ID,
FILE_MIME_ID, SHA256, MD5,
NUM_ATTACHMENTS, ATTACHMENT_TYPE, EMBEDDED_DEPTH, HAS_CONTENT,
+ //charset detection (encodings table): final pick, winning detector,
declared-via-metadata (Content-Type-Hint)
+ DETECTED_ENCODING, ENCODING_DETECTOR, DECLARED_METADATA,
//content
- CONTENT_LENGTH, NUM_UNIQUE_TOKENS, NUM_TOKENS,
NUM_UNIQUE_ALPHABETIC_TOKENS, NUM_ALPHABETIC_TOKENS, //alphabetic or
ideographic tokens
+ CONTENT_LENGTH, NUM_REPLACEMENT, NUM_NON_ASCII, //U+FFFD + non-ASCII
(>=0x80) counts; FFFD rate via SQL: num_replacement/num_non_ascii
+
+ NUM_UNIQUE_TOKENS, NUM_TOKENS, NUM_UNIQUE_ALPHABETIC_TOKENS,
NUM_ALPHABETIC_TOKENS, //alphabetic or ideographic tokens
COMMON_TOKENS_LANG, //which language was used for the common tokens metric?
NUM_UNIQUE_COMMON_TOKENS, NUM_COMMON_TOKENS, TOP_N_TOKENS, LANG_ID_1,
LANG_ID_PROB_1, LANG_ID_2,
OOV, LANGUAGENESS, LANG_ID_PROB_2, TOKEN_ENTROPY_RATE, TOKEN_LENGTH_SUM,
TOKEN_LENGTH_MEAN,
diff --git
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/MarkdownSummaryWriter.java
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/MarkdownSummaryWriter.java
index dfc2f5f8cb..75bc7d4d4b 100644
---
a/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/MarkdownSummaryWriter.java
+++
b/tika-eval/tika-eval-app/src/main/java/org/apache/tika/eval/app/reports/MarkdownSummaryWriter.java
@@ -478,7 +478,9 @@ public class MarkdownSummaryWriter {
"round(avg(cb.oov) - avg(ca.oov), 4) as OOV_DELTA, " +
"round(avg(ca.languageness), 2) as MEAN_LANG_A, " +
"round(avg(cb.languageness), 2) as MEAN_LANG_B, " +
- "round(avg(cb.languageness) - avg(ca.languageness), 2) as
LANG_DELTA " +
+ "round(avg(cb.languageness) - avg(ca.languageness), 2) as
LANG_DELTA, " +
+ "round(avg(ca.num_replacement), 1) as MEAN_FFFD_A, " +
+ "round(avg(cb.num_replacement), 1) as MEAN_FFFD_B " +
"from contents_a ca " +
"join contents_b cb on ca.id = cb.id " +
"join profiles_a pa on ca.id = pa.id " +
@@ -498,6 +500,8 @@ public class MarkdownSummaryWriter {
"round(cb.oov - ca.oov, 4) as OOV_DELTA, " +
"round(ca.languageness, 2) as LANG_A, " +
"round(cb.languageness, 2) as LANG_B, " +
+ "ca.num_replacement as FFFD_A, " +
+ "cb.num_replacement as FFFD_B, " +
"ca.lang_id_1 as LANG_ID_A, " +
"cb.lang_id_1 as LANG_ID_B " +
"from contents_a ca " +
@@ -523,6 +527,8 @@ public class MarkdownSummaryWriter {
"round(cb.languageness - ca.languageness, 2) as LANG_DELTA, " +
"round(ca.oov, 4) as OOV_A, " +
"round(cb.oov, 4) as OOV_B, " +
+ "ca.num_replacement as FFFD_A, " +
+ "cb.num_replacement as FFFD_B, " +
"ca.lang_id_1 as LANG_ID_A, " +
"cb.lang_id_1 as LANG_ID_B " +
"from contents_a ca " +
diff --git
a/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/langid/LanguageIDWrapper.java
b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/langid/LanguageIDWrapper.java
index c17e6b01f2..2b2e995cf7 100644
---
a/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/langid/LanguageIDWrapper.java
+++
b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/langid/LanguageIDWrapper.java
@@ -38,10 +38,24 @@ public class LanguageIDWrapper implements
StringStatsCalculator<List<LanguageRes
public List<LanguageResult> calculate(String txt) {
CharSoupLanguageDetector detector = new CharSoupLanguageDetector();
detector.setMaxLength(MAX_TEXT_LENGTH);
- detector.addText(txt);
+ detector.addText(normalizeWhitespace(txt));
return detector.detectAll();
}
+ /**
+ * Collapse whitespace runs and trim before langid: the truncation window
+ * counts whitespace, so extracts differing only in whitespace can flip the
+ * detected language and pick different common-token dictionaries in an A/B
+ * eval. CharSoup features are whitespace-invariant, so this only
stabilizes
+ * the window, not the scoring.
+ */
+ static String normalizeWhitespace(String txt) {
+ if (txt == null) {
+ return "";
+ }
+ return txt.replaceAll("\\s+", " ").trim();
+ }
+
public static Set<String> getSupportedLanguages() {
return CharSoupLanguageDetector.getSupportedLanguages();
diff --git
a/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/NonAsciiCharCounter.java
b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/NonAsciiCharCounter.java
new file mode 100644
index 0000000000..7cd87dd810
--- /dev/null
+++
b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/NonAsciiCharCounter.java
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.eval.core.textstats;
+
+/**
+ * Counts non-ASCII characters (code units ≥ U+0080) in the extracted text.
+ *
+ * <p>Used as the denominator for the U+FFFD rate (see {@link
ReplacementCharCounter}):
+ * decode failures only arise from high bytes, so FFFD as a fraction of
non-ASCII
+ * chars is the un-diluted signal, whereas FFFD over total length collapses to
~0
+ * on COMMON / ASCII-dominated documents. U+FFFD itself is ≥ 0x80, so it is
+ * included in this count, keeping the rate in [0, 100].</p>
+ */
+public class NonAsciiCharCounter implements StringStatsCalculator<Integer> {
+ @Override
+ public Integer calculate(String txt) {
+ int n = 0;
+ for (int i = 0; i < txt.length(); i++) {
+ if (txt.charAt(i) >= 0x80) {
+ n++;
+ }
+ }
+ return n;
+ }
+}
diff --git
a/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/ReplacementCharCounter.java
b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/ReplacementCharCounter.java
new file mode 100644
index 0000000000..a55dbdc65c
--- /dev/null
+++
b/tika-eval/tika-eval-core/src/main/java/org/apache/tika/eval/core/textstats/ReplacementCharCounter.java
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.eval.core.textstats;
+
+/**
+ * Counts U+FFFD (REPLACEMENT CHARACTER) occurrences in the extracted text.
+ *
+ * <p>A high replacement-char count signals a decode failure — the charset used
+ * to decode the bytes couldn't map them, producing U+FFFD. Unlike OOV, this
is
+ * a structural correctness signal that does not depend on the per-language
+ * vocabulary, so it does not mis-rank CJK decodes (real CJK is OOV-heavy but
+ * has zero U+FFFD; mojibake under the wrong charset has many).</p>
+ */
+public class ReplacementCharCounter implements StringStatsCalculator<Integer> {
+ @Override
+ public Integer calculate(String txt) {
+ int n = 0;
+ for (int i = 0; i < txt.length(); i++) {
+ if (txt.charAt(i) == 0xFFFD) {
+ n++;
+ }
+ }
+ return n;
+ }
+}
diff --git
a/tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java
b/tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java
index 056c768a65..a7706e2118 100644
---
a/tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java
+++
b/tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetector.java
@@ -18,9 +18,13 @@ package org.apache.tika.ml.junkdetect;
import java.io.IOException;
import java.nio.charset.Charset;
+import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
import java.util.LinkedHashMap;
+import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
@@ -29,8 +33,10 @@ import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.tika.config.TikaComponent;
+import org.apache.tika.detect.CharsetSupersets;
import org.apache.tika.detect.EncodingDetectorContext;
import org.apache.tika.detect.EncodingResult;
+import org.apache.tika.detect.HighByteLetterStats;
import org.apache.tika.detect.MetaEncodingDetector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
@@ -80,6 +86,17 @@ public class JunkFilterEncodingDetector implements
MetaEncodingDetector {
* anchor instead of arbitrating near-identical decodes by quality. */
private static final float NO_INFO_CONFIDENCE = 0.1f;
+ // Adaptive candidate band (TIKA speed lever). The tournament only needs
+ // NB's top-2 statistical candidates plus any lower-ranked candidate whose
+ // confidence is still at least MIN_TAIL_CONFIDENCE (an absolute floor, not
+ // a band relative to the top); deeper, low-confidence candidates are
clearly
+ // dominated and almost never win (measured: a 0.5 floor retains ~98-99% of
+ // selected winners, ~20% smaller pool). Anchors (DECLARATIVE, STRUCTURAL)
+ // are always kept regardless of confidence. Quality impact is validated
by
+ // a full common-token/OOV eval, NOT assumed.
+ private static final int ALWAYS_KEEP_TOP_N = 2;
+ private static final float MIN_TAIL_CONFIDENCE = 0.5f;
+
/** Cached quality detector. {@code null} if none is on the classpath. */
private final TextQualityDetector qualityDetector;
@@ -152,24 +169,25 @@ public class JunkFilterEncodingDetector implements
MetaEncodingDetector {
// become codepoints whose cross-script transitions expose mojibake
// under a wrong decoding (AIT5 case).
Map<Charset, String> candidates = new LinkedHashMap<>();
- for (Charset cs : uniqueCharsets) {
- String decoded = safeDecode(bytes, cs);
- if (decoded != null && !decoded.isEmpty()) {
- decoded = HtmlContentCleaner.clean(decoded);
+ // Dedup: charsets that decode the raw probe to the identical string
+ // (e.g. GB18030/GBK, x-windows-949/EUC-KR on non-extension content)
+ // share one clean() call — the cleaned result is identical by
+ // construction, so this is quality-neutral, purely a work saving.
+ Map<String, String> cleanedByRaw = new HashMap<>();
+ Set<Charset> candidateCharsets = bandFilter(context, uniqueCharsets);
+ for (Charset cs : candidateCharsets) {
+ String raw = safeDecode(bytes, cs);
+ if (raw == null || raw.isEmpty()) {
+ LOG.trace("junk-filter decode {} -> null/empty", cs.name());
+ continue;
+ }
+ String decoded = cleanedByRaw.get(raw);
+ if (decoded == null) {
+ decoded = HtmlContentCleaner.clean(raw);
+ cleanedByRaw.put(raw, decoded);
}
if (decoded != null && !decoded.isEmpty()) {
candidates.put(cs, decoded);
- if (LOG.isTraceEnabled()) {
- int sampleLen = Math.min(400, decoded.length());
- String sample = decoded.substring(0, sampleLen)
- .replace('\n', ' ').replace('\r', ' ');
- LOG.trace("junk-filter decoded {}: '{}{}' (len={})",
- cs.name(), sample,
- decoded.length() > sampleLen ? "…" : "",
- decoded.length());
- }
- } else {
- LOG.trace("junk-filter decode {} -> null/empty", cs.name());
}
}
if (candidates.size() <= 1) {
@@ -228,17 +246,31 @@ public class JunkFilterEncodingDetector implements
MetaEncodingDetector {
Charset champion = null;
double championZ = Double.NEGATIVE_INFINITY;
Map<Charset, Double> scoreByCharset = new LinkedHashMap<>();
+ Map<Charset, Double> diffByCharset = new LinkedHashMap<>();
+ // Dedup by text: [0] = whole-text z (the champion + anchor metric,
kept
+ // exactly as before); [1] = script-letter "diff" z (codepoints >= 0x80
+ // that are letters/ideographs — the high bytes where the candidate
+ // decodes actually differ), used ONLY for the family gate below.
+ Map<String, float[]> zByText = new HashMap<>();
for (Map.Entry<Charset, String> entry : candidates.entrySet()) {
- org.apache.tika.quality.TextQualityScore sc =
- qualityDetector.score(entry.getValue());
- float rawZ = sc.isUnknown() ? Float.NEGATIVE_INFINITY :
sc.getZScore();
- scoreByCharset.put(entry.getKey(), (double) rawZ);
- LOG.trace("junk-filter score {} z={} script={}",
- entry.getKey().name(),
- String.format(java.util.Locale.ROOT, "%.3f", rawZ),
- sc.isUnknown() ? "UNKNOWN" : sc.getDominantScript());
- if (rawZ > championZ) {
- championZ = rawZ;
+ String text = entry.getValue();
+ float[] zs = zByText.get(text);
+ if (zs == null) {
+ org.apache.tika.quality.TextQualityScore sc =
qualityDetector.score(text);
+ float wholeZ = sc.isUnknown() ? Float.NEGATIVE_INFINITY :
sc.getZScore();
+ String diff = scriptLetters(text);
+ float diffZ = Float.NEGATIVE_INFINITY;
+ if (!diff.isEmpty()) {
+ org.apache.tika.quality.TextQualityScore d =
qualityDetector.score(diff);
+ diffZ = d.isUnknown() ? Float.NEGATIVE_INFINITY :
d.getZScore();
+ }
+ zs = new float[]{wholeZ, diffZ};
+ zByText.put(text, zs);
+ }
+ scoreByCharset.put(entry.getKey(), (double) zs[0]);
+ diffByCharset.put(entry.getKey(), (double) zs[1]);
+ if (zs[0] > championZ) {
+ championZ = zs[0];
champion = entry.getKey();
}
}
@@ -248,6 +280,48 @@ public class JunkFilterEncodingDetector implements
MetaEncodingDetector {
champion = candidates.keySet().iterator().next();
}
+ // CJK-vs-non-CJK family gate. The whole-text z coin-flips on the
+ // CJK/non-CJK BOUNDARY for COMMON-dominated docs (markup/digits/punct
+ // decode identically and swamp the few discriminating high bytes),
+ // producing false-CJK and real-CJK demotion. The script-letter
"diff" z
+ // reads that boundary cleanly (coherent CJK vs garbage), so use it to
+ // decide ONLY the family; within a family the whole-text champion
stands
+ // (Latin-vs-Latin etc. untouched — a blanket diff-score regressed
there).
+ // Override only on a clear diff margin.
+ double bestCjkDiff = Double.NEGATIVE_INFINITY;
+ double bestNonCjkDiff = Double.NEGATIVE_INFINITY;
+ for (Map.Entry<Charset, Double> e : diffByCharset.entrySet()) {
+ if (isCjkCharset(e.getKey().name())) {
+ bestCjkDiff = Math.max(bestCjkDiff, e.getValue());
+ } else {
+ bestNonCjkDiff = Math.max(bestNonCjkDiff, e.getValue());
+ }
+ }
+ // DEMOTE-ONLY: fire only to demote a CJK champion to non-CJK when the
+ // diff z clearly prefers non-CJK (the false-CJK fix). The reverse
+ // (promote non-CJK -> CJK) is NOT done: measured at 29k, the diff z
+ // reliably says "this CJK pick is really non-CJK" (OOV improves on
every
+ // such flip) but UNreliably says "this non-CJK pick is really CJK"
(the
+ // junk model over-rates ideograph mojibake vs sparse Latin letters —
OOV
+ // worsened on every promote flip). The promote direction is also
+ // unnecessary: genuine CJK is html-meta-declared upstream.
+ if (isCjkCharset(champion.name())
+ && bestNonCjkDiff > bestCjkDiff + FAMILY_DIFF_MARGIN) {
+ Charset reFam = bestInFamily(scoreByCharset, false);
+ if (reFam != null) {
+ LOG.trace("junk-filter family gate: {} (CJK) -> {} (non-CJK by
diff z)",
+ champion.name(), reFam.name());
+ champion = reFam;
+ }
+ }
+
+ // Within-Latin letter gate (demote-only). Sibling to the CJK gate,
+ // for the other boundary the whole-text z can't see: a DOS-OEM / Mac
+ // pick whose high bytes decode to box-drawing/symbols beating the
+ // windows-1252 truth under COMMON-dilution. Cased-letter count reads
+ // it where typicality ties. See {@link #applyLatinLetterGate}.
+ champion = applyLatinLetterGate(bytes, champion, candidates.keySet());
+
// "No-info" guard: if the statistical layer produced no confident
// answer — no STRUCTURAL proof, and its best STATISTICAL candidate is
// no better than Mojibuster's windows-1252 "I don't know" fallback
@@ -274,6 +348,154 @@ public class JunkFilterEncodingDetector implements
MetaEncodingDetector {
return List.of(new EncodingResult(champion, confidence));
}
+ /** Minimum diff-z margin by which the other family must beat the
champion's
+ * family before the family gate overrides. Large enough to ignore the
+ * noise-level boundary ties; real CJK-vs-garbage diffs are far larger. */
+ private static final double FAMILY_DIFF_MARGIN = 2.0;
+
+ private static boolean isCjkCharset(String name) {
+ String n = name.toLowerCase(java.util.Locale.ROOT);
+ return n.contains("gb") || n.contains("big5") || n.contains("euc")
+ || n.contains("shift") || n.contains("jis") ||
n.contains("2022")
+ || n.contains("949");
+ }
+
+ /** Highest whole-text-z candidate within the requested family (CJK or
not). */
+ private static Charset bestInFamily(Map<Charset, Double> wholeZ, boolean
cjk) {
+ Charset best = null;
+ double bz = Double.NEGATIVE_INFINITY;
+ for (Map.Entry<Charset, Double> e : wholeZ.entrySet()) {
+ if (isCjkCharset(e.getKey().name()) == cjk && e.getValue() > bz) {
+ bz = e.getValue();
+ best = e.getKey();
+ }
+ }
+ return best;
+ }
+
+ /** Script-letter "diff" content: codepoints ≥ 0x80 that are letters/
+ * ideographs — the high bytes where candidate decodes differ. Shared
ASCII
+ * and non-ASCII punctuation/symbols are dropped (they dilute toward a
+ * COMMON-dominated tie). Used only for the CJK-vs-non-CJK family gate.
*/
+ private static String scriptLetters(String s) {
+ StringBuilder b = new StringBuilder();
+ s.codePoints().forEach(c -> {
+ if (c >= 0x80 && Character.isLetter(c)) {
+ b.appendCodePoint(c);
+ }
+ });
+ return b.toString();
+ }
+
+ /** Canonical {@code Charset.name()} of the WHATWG-default Latin fallback.
*/
+ private static final String WIN1252 = "windows-1252";
+
+ /** Latin single-byte charsets the within-Latin letter gate may arbitrate.
+ * EXCLUDES non-Latin SBCS (Cyrillic windows-1251 / ISO-8859-5, Greek
+ * -1253 / -7, Hebrew -1255 / -8, Arabic -1256 / -6, Thai) whose cased
+ * letters would pollute the count, and all multi-byte CJK (the family
+ * gate's territory). */
+ private static final Set<String> LATIN_SBCS = new HashSet<>(Arrays.asList(
+ "windows-1252", "windows-1250", "windows-1254", "windows-1257",
"windows-1258",
+ "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4",
"ISO-8859-9",
+ "ISO-8859-10", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15",
"ISO-8859-16",
+ "IBM437", "IBM850", "IBM852", "IBM858", "IBM860", "IBM861",
"IBM863", "IBM865",
+ "x-MacRoman", "x-MacCentralEurope", "x-MacRomania",
"x-MacIceland"));
+
+ /** Probe must have at least this many high bytes for the gate to act —
+ * below it the letter gap is noise (most over-picks are sparse). */
+ private static final int LATIN_GATE_MIN_HIGH_BYTES = 16;
+ /** windows-1252 must win the cased-letter count by > max(FLOOR,
FRACTION
+ * * highBytes). The margin lets the gate cover Central-European / DOS
+ * siblings safely — genuine CE text wins MORE letters under its true
+ * charset so the gate stays silent — without the tie-flip that forces the
+ * mojibuster Western-Latin fallback to scope itself out of those
families. */
+ private static final double LATIN_GATE_MARGIN_FLOOR = 6.0;
+ private static final double LATIN_GATE_MARGIN_FRACTION = 0.20;
+
+ /**
+ * Within-Latin letter-plausibility gate (demote-only). Demotes {@code
+ * champion} to windows-1252 only when windows-1252 is a live candidate,
both
+ * are Latin SBCS, the probe is high-byte-dense, and windows-1252 decodes
+ * clearly MORE cased high-byte letters than the champion — the box-drawing
+ * signature, where a wrong IBM850 / x-MacRoman decode maps high bytes to
+ * symbols. The compare is directional: a genuine Central-European / DOS
doc
+ * wins MORE letters under its true charset, so the gate leaves it
untouched.
+ * Latin-scoped so it never crosses the CJK boundary (the family gate
above)
+ * or touches non-Latin SBCS. Returns the (possibly demoted) charset.
+ */
+ static Charset applyLatinLetterGate(byte[] probe, Charset champion,
+ Set<Charset> candidates) {
+ String name = champion.name();
+ if (WIN1252.equals(name) || !LATIN_SBCS.contains(name)) {
+ return champion;
+ }
+ Charset win = null;
+ for (Charset c : candidates) {
+ if (WIN1252.equals(c.name())) {
+ win = c;
+ break;
+ }
+ }
+ if (win == null) {
+ return champion;
+ }
+ int high = HighByteLetterStats.countHighBytes(probe);
+ if (high < LATIN_GATE_MIN_HIGH_BYTES) {
+ return champion;
+ }
+ int winLetters = HighByteLetterStats.countCasedHighByteLetters(probe,
win);
+ int champLetters =
HighByteLetterStats.countCasedHighByteLetters(probe, champion);
+ double margin = Math.max(LATIN_GATE_MARGIN_FLOOR,
LATIN_GATE_MARGIN_FRACTION * high);
+ if (winLetters > champLetters + margin) {
+ LOG.trace("junk-filter latin gate: {} -> windows-1252 (cased
high-byte "
+ + "letters {} vs {}, high={})", name, champLetters,
winLetters, high);
+ return win;
+ }
+ return champion;
+ }
+
+ /**
+ * Restrict the candidate set the tournament will decode+clean+score: keep
+ * every DECLARATIVE/STRUCTURAL anchor (author intent / byte-grammar
proof),
+ * plus the top {@link #ALWAYS_KEEP_TOP_N} STATISTICAL candidates by
+ * confidence, plus any deeper STATISTICAL candidate whose confidence is at
+ * least {@link #MIN_TAIL_CONFIDENCE} (an absolute floor). Drops the
+ * dominated low-confidence tail —
+ * the speed lever — without removing any anchor or NB's real contenders.
+ * Returns a subset of {@code all}, preserving its iteration order.
+ */
+ private static Set<Charset> bandFilter(EncodingDetectorContext context,
Set<Charset> all) {
+ Set<Charset> anchors = new HashSet<>();
+ List<EncodingResult> stats = new ArrayList<>();
+ for (EncodingDetectorContext.Result r : context.getResults()) {
+ for (EncodingResult er : r.getEncodingResults()) {
+ EncodingResult.ResultType t = er.getResultType();
+ if (t == EncodingResult.ResultType.DECLARATIVE
+ || t == EncodingResult.ResultType.STRUCTURAL) {
+ anchors.add(er.getCharset());
+ } else if (t == EncodingResult.ResultType.STATISTICAL) {
+ stats.add(er);
+ }
+ }
+ }
+ stats.sort((a, b) -> Float.compare(b.getConfidence(),
a.getConfidence()));
+ Set<Charset> keepStat = new HashSet<>();
+ for (int i = 0; i < stats.size(); i++) {
+ if (i < ALWAYS_KEEP_TOP_N
+ || stats.get(i).getConfidence() >= MIN_TAIL_CONFIDENCE) {
+ keepStat.add(stats.get(i).getCharset());
+ }
+ }
+ Set<Charset> kept = new LinkedHashSet<>();
+ for (Charset cs : all) {
+ if (anchors.contains(cs) || keepStat.contains(cs)) {
+ kept.add(cs);
+ }
+ }
+ return kept;
+ }
+
/**
* True if some detector produced a confident non-declarative signal: any
* STRUCTURAL result (byte-grammar proof), or any STATISTICAL result above
@@ -369,10 +591,17 @@ public class JunkFilterEncodingDetector implements
MetaEncodingDetector {
}
private static String safeDecode(byte[] bytes, Charset charset) {
+ // Score CJK candidates on their vendor superset, not the strict base
+ // (which U+FFFDs vendor-extension chars and unfairly penalizes real
+ // CJK). AutoDetectReader re-applies the same superset for content.
+ Charset decodeAs = CharsetSupersets.supersetOf(charset);
+ if (decodeAs == null) {
+ decodeAs = charset;
+ }
try {
- return new String(bytes, charset);
+ return new String(bytes, decodeAs);
} catch (Exception e) {
- LOG.debug("Decode failed for {}: {}", charset.name(),
e.toString());
+ LOG.debug("Decode failed for {}: {}", decodeAs.name(),
e.toString());
return null;
}
}
diff --git
a/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java
b/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java
index 705cbbe99f..a0904289b8 100644
---
a/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java
+++
b/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkFilterEncodingDetectorTest.java
@@ -67,6 +67,83 @@ public class JunkFilterEncodingDetectorTest {
}
}
+ /**
+ * Functional stub for the CJK-vs-non-CJK family gate. Returns one of four
+ * controlled z-scores per scored string, keyed on whether the string
+ * contains Han ideographs (CJK family) and whether it is a "diff" string
+ * (script-letters only, i.e. every codepoint ≥ 0x80) vs whole text.
+ * Lets us drive {@code JunkFilterEncodingDetector}'s gate
deterministically
+ * without the real model: the detector scores both the whole decoded text
+ * (champion metric) and its script-letter diff (family-gate metric) for
+ * each candidate, so the four cells fully determine the gate's decision.
+ */
+ private static final class ZStub implements TextQualityDetector {
+ private final double wholeCjk;
+ private final double wholeNonCjk;
+ private final double diffCjk;
+ private final double diffNonCjk;
+
+ ZStub(double wholeCjk, double wholeNonCjk, double diffCjk, double
diffNonCjk) {
+ this.wholeCjk = wholeCjk;
+ this.wholeNonCjk = wholeNonCjk;
+ this.diffCjk = diffCjk;
+ this.diffNonCjk = diffNonCjk;
+ }
+
+ private static boolean isCjk(String s) {
+ return s.codePoints().anyMatch(c -> c >= 0x4E00 && c <= 0x9FFF);
+ }
+
+ /** Diff string = script-letters only: non-empty, every codepoint ≥
0x80. */
+ private static boolean isDiff(String s) {
+ return !s.isEmpty() && s.codePoints().allMatch(c -> c >= 0x80);
+ }
+
+ @Override
+ public TextQualityScore score(String text) {
+ boolean cjk = isCjk(text);
+ double z = isDiff(text)
+ ? (cjk ? diffCjk : diffNonCjk)
+ : (cjk ? wholeCjk : wholeNonCjk);
+ return new TextQualityScore((float) z, Float.NaN, Float.NaN,
Float.NaN,
+ cjk ? "HAN" : "LATIN");
+ }
+
+ @Override
+ public TextQualityComparison compare(String labelA, String candidateA,
+ String labelB, String candidateB)
{
+ // Not exercised by the gate path (which uses score()); provided
only
+ // to satisfy the interface. Honor the contract: winner() must be
+ // labelA or labelB, picked by the higher (cleaner) z-score.
+ TextQualityScore a = score(candidateA);
+ TextQualityScore b = score(candidateB);
+ String winner = a.getZScore() >= b.getZScore() ? labelA : labelB;
+ float delta = Math.abs(a.getZScore() - b.getZScore());
+ return new TextQualityComparison(winner, delta, a, b, labelA,
labelB);
+ }
+ }
+
+ /**
+ * ASCII filler + 20 copies of the byte pair {@code {0xC4, 0xE3}}: decodes
to
+ * Han ideographs (你…) under GB18030 but accented Latin (Ä ã…) under
+ * windows-1252. A clean false-CJK vs real-CJK probe — the ASCII keeps the
+ * whole-text strings out of the "diff" bucket, while the high bytes are
the
+ * only place the two decodes disagree.
+ */
+ private static byte[] cjkAmbiguousBytes() {
+ byte[] ascii = "the quick brown fox jumps over the lazy dog "
+ .getBytes(StandardCharsets.US_ASCII);
+ byte[] hi = new byte[40];
+ for (int i = 0; i < 20; i++) {
+ hi[2 * i] = (byte) 0xC4;
+ hi[2 * i + 1] = (byte) 0xE3;
+ }
+ byte[] out = new byte[ascii.length + hi.length];
+ System.arraycopy(ascii, 0, out, 0, ascii.length);
+ System.arraycopy(hi, 0, out, ascii.length, hi.length);
+ return out;
+ }
+
private static ParseContext contextWith(EncodingResult... results) {
EncodingDetectorContext ctx = new EncodingDetectorContext();
ctx.addResult(List.of(results), "stub");
@@ -254,4 +331,110 @@ public class JunkFilterEncodingDetectorTest {
String expected = "ത്ര";
assertEquals(expected,
JunkFilterEncodingDetector.expandHtmlEntities(input));
}
+
+ // ----- CJK-vs-non-CJK family gate (the demote-only false-CJK fix) -----
+ //
+ // The whole-text z coin-flips on the CJK/non-CJK boundary for
+ // COMMON-dominated docs: markup/digits/punctuation decode identically
under
+ // every candidate and swamp the few discriminating high bytes, so the junk
+ // model's whole-text argmax sometimes crowns a garbage CJK decode over the
+ // correct single-byte one (false-CJK), and sometimes the reverse. The
+ // script-letter "diff" z reads that boundary cleanly (coherent CJK vs
+ // ideograph mojibake), so the gate uses it to decide ONLY the family.
+ // Measured at 29k, the diff z reliably DEMOTES (CJK champion -> non-CJK;
OOV
+ // improves on every flip) but UNreliably promotes, so the gate is
+ // demote-only and fires only past FAMILY_DIFF_MARGIN. These four tests
lock
+ // each arm of that decision against the {@link ZStub}.
+
+ @Test
+ public void familyGate_demotesFalseCjkToNonCjk() throws Exception {
+ // Whole-text champion is the CJK pick (the coin-flip), but the diff z
+ // clearly prefers the non-CJK decode (coherent Latin >> ideograph
+ // mojibake, margin 7.0 > 2.0) -> gate must demote to windows-1252.
+ Charset gb = Charset.forName("GB18030");
+ Charset win1252 = Charset.forName("windows-1252");
+ ParseContext pc = contextWith(
+ new EncodingResult(gb, 0.8f, "GB18030",
+ EncodingResult.ResultType.STATISTICAL),
+ new EncodingResult(win1252, 0.7f, "windows-1252",
+ EncodingResult.ResultType.STATISTICAL));
+ // wholeCjk(-1.0) > wholeNonCjk(-1.5); diffNonCjk(-1.0) >>
diffCjk(-8.0)
+ JunkFilterEncodingDetector detector =
+ new JunkFilterEncodingDetector(new ZStub(-1.0, -1.5, -8.0,
-1.0));
+ try (TikaInputStream tis = TikaInputStream.get(cjkAmbiguousBytes())) {
+ List<EncodingResult> out = detector.detect(tis, new Metadata(),
pc);
+ assertEquals(1, out.size());
+ assertEquals(win1252, out.get(0).getCharset(),
+ "diff z prefers non-CJK by > FAMILY_DIFF_MARGIN -> CJK "
+ + "champion must be demoted to windows-1252");
+ }
+ }
+
+ @Test
+ public void familyGate_keepsRealCjkWhenDiffAgrees() throws Exception {
+ // Whole-text champion is CJK and the diff z AGREES (ideographs
coherent,
+ // Latin garbage) -> gate must NOT fire; real CJK stays CJK.
+ Charset gb = Charset.forName("GB18030");
+ Charset win1252 = Charset.forName("windows-1252");
+ ParseContext pc = contextWith(
+ new EncodingResult(gb, 0.8f, "GB18030",
+ EncodingResult.ResultType.STATISTICAL),
+ new EncodingResult(win1252, 0.7f, "windows-1252",
+ EncodingResult.ResultType.STATISTICAL));
+ // diffCjk(-1.0) >> diffNonCjk(-8.0): non-CJK does not beat CJK -> no
demote
+ JunkFilterEncodingDetector detector =
+ new JunkFilterEncodingDetector(new ZStub(-1.0, -1.5, -1.0,
-8.0));
+ try (TikaInputStream tis = TikaInputStream.get(cjkAmbiguousBytes())) {
+ List<EncodingResult> out = detector.detect(tis, new Metadata(),
pc);
+ assertEquals(1, out.size());
+ assertEquals(gb, out.get(0).getCharset(),
+ "diff z agrees with the CJK champion -> must not demote");
+ }
+ }
+
+ @Test
+ public void familyGate_isDemoteOnly_neverPromotesNonCjkToCjk() throws
Exception {
+ // Whole-text champion is NON-CJK; even though the diff z would prefer
+ // CJK, the gate is demote-only (the promote direction regressed at
29k),
+ // so the non-CJK champion must stand.
+ Charset gb = Charset.forName("GB18030");
+ Charset win1252 = Charset.forName("windows-1252");
+ ParseContext pc = contextWith(
+ new EncodingResult(gb, 0.7f, "GB18030",
+ EncodingResult.ResultType.STATISTICAL),
+ new EncodingResult(win1252, 0.8f, "windows-1252",
+ EncodingResult.ResultType.STATISTICAL));
+ // wholeNonCjk(-1.0) > wholeCjk(-1.5) -> champion non-CJK; diffCjk
strong but ignored
+ JunkFilterEncodingDetector detector =
+ new JunkFilterEncodingDetector(new ZStub(-1.5, -1.0, -1.0,
-8.0));
+ try (TikaInputStream tis = TikaInputStream.get(cjkAmbiguousBytes())) {
+ List<EncodingResult> out = detector.detect(tis, new Metadata(),
pc);
+ assertEquals(1, out.size());
+ assertEquals(win1252, out.get(0).getCharset(),
+ "gate is demote-only: a non-CJK champion is never promoted
to CJK");
+ }
+ }
+
+ @Test
+ public void familyGate_respectsDiffMargin() throws Exception {
+ // Non-CJK diff z beats CJK diff z, but by LESS than FAMILY_DIFF_MARGIN
+ // (2.0): a boundary-noise tie, not a clear signal -> no demote.
+ Charset gb = Charset.forName("GB18030");
+ Charset win1252 = Charset.forName("windows-1252");
+ ParseContext pc = contextWith(
+ new EncodingResult(gb, 0.8f, "GB18030",
+ EncodingResult.ResultType.STATISTICAL),
+ new EncodingResult(win1252, 0.7f, "windows-1252",
+ EncodingResult.ResultType.STATISTICAL));
+ // diffNonCjk(-1.0) - diffCjk(-2.0) = 1.0 < margin 2.0 -> no demote
+ JunkFilterEncodingDetector detector =
+ new JunkFilterEncodingDetector(new ZStub(-1.0, -1.5, -2.0,
-1.0));
+ try (TikaInputStream tis = TikaInputStream.get(cjkAmbiguousBytes())) {
+ List<EncodingResult> out = detector.detect(tis, new Metadata(),
pc);
+ assertEquals(1, out.size());
+ assertEquals(gb, out.get(0).getCharset(),
+ "diff margin below FAMILY_DIFF_MARGIN -> no demote "
+ + "(boundary-noise guard)");
+ }
+ }
}
diff --git
a/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/LatinLetterGateTest.java
b/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/LatinLetterGateTest.java
new file mode 100644
index 0000000000..441677ad65
--- /dev/null
+++
b/tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/LatinLetterGateTest.java
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.junkdetect;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import java.nio.charset.Charset;
+import java.util.LinkedHashSet;
+import java.util.Set;
+
+import org.junit.jupiter.api.Test;
+
+/**
+ * Unit tests for the within-Latin letter-plausibility gate
+ * ({@link JunkFilterEncodingDetector#applyLatinLetterGate}) in isolation.
+ */
+public class LatinLetterGateTest {
+
+ private static final Charset WIN1252 = Charset.forName("windows-1252");
+ private static final Charset IBM850 = Charset.forName("IBM850");
+ private static final Charset ISO_8859_2 = Charset.forName("ISO-8859-2");
+ private static final Charset WIN1251 = Charset.forName("windows-1251");
+ private static final Charset SHIFT_JIS = Charset.forName("Shift_JIS");
+
+ private static Set<Charset> candidates(Charset... cs) {
+ Set<Charset> s = new LinkedHashSet<>();
+ for (Charset c : cs) {
+ s.add(c);
+ }
+ return s;
+ }
+
+ /** 0xC0-0xCF: À-Ï (letters) in windows-1252, box-drawing in IBM850 →
+ * windows-1252 wins the letter count decisively → demote. */
+ private static byte[] boxDrawingProbe(int repeats) {
+ byte[] probe = new byte[16 * repeats];
+ for (int r = 0; r < repeats; r++) {
+ for (int i = 0; i < 16; i++) {
+ probe[r * 16 + i] = (byte) (0xC0 + i);
+ }
+ }
+ return probe;
+ }
+
+ @Test
+ void demotesBoxDrawingIbm850ToWindows1252() {
+ Charset out = JunkFilterEncodingDetector.applyLatinLetterGate(
+ boxDrawingProbe(3), IBM850, candidates(IBM850, WIN1252));
+ assertEquals(WIN1252, out, "box-drawing IBM850 pick should demote to
windows-1252");
+ }
+
+ /** Bytes that are Central-European letters in ISO-8859-2 (Ą Ł Ś Š Ž ...)
but
+ * symbols (¡ £ ¦ © ...) in windows-1252. ISO-8859-2 wins the letter
count,
+ * so the directional gate must NOT flip genuine CE text. */
+ @Test
+ void keepsGenuineCentralEuropean() {
+ int[] ceLetters = {0xA1, 0xA3, 0xA5, 0xA6, 0xA9, 0xAB, 0xAC, 0xAE,
0xAF,
+ 0xB1, 0xB3, 0xB6, 0xB9, 0xBB, 0xBC, 0xBE, 0xBF};
+ byte[] probe = new byte[ceLetters.length];
+ for (int i = 0; i < ceLetters.length; i++) {
+ probe[i] = (byte) ceLetters[i];
+ }
+ Charset out = JunkFilterEncodingDetector.applyLatinLetterGate(
+ probe, ISO_8859_2, candidates(ISO_8859_2, WIN1252));
+ assertEquals(ISO_8859_2, out, "genuine CE text wins letters under its
true charset");
+ }
+
+ @Test
+ void silentBelowHighByteFloor() {
+ byte[] sparse = {(byte) 0xC0, (byte) 0xC1, (byte) 0xC2};
+ Charset out = JunkFilterEncodingDetector.applyLatinLetterGate(
+ sparse, IBM850, candidates(IBM850, WIN1252));
+ assertEquals(IBM850, out, "below the high-byte floor the gate must not
act");
+ }
+
+ @Test
+ void silentOnNonLatinChampion() {
+ Charset out = JunkFilterEncodingDetector.applyLatinLetterGate(
+ boxDrawingProbe(3), WIN1251, candidates(WIN1251, WIN1252));
+ assertEquals(WIN1251, out, "Cyrillic champion is outside the Latin
allowlist");
+ }
+
+ @Test
+ void silentOnCjkChampion() {
+ Charset out = JunkFilterEncodingDetector.applyLatinLetterGate(
+ boxDrawingProbe(3), SHIFT_JIS, candidates(SHIFT_JIS, WIN1252));
+ assertEquals(SHIFT_JIS, out, "CJK champion is the family gate's
territory, not this one");
+ }
+
+ @Test
+ void silentWhenWindows1252NotACandidate() {
+ Charset out = JunkFilterEncodingDetector.applyLatinLetterGate(
+ boxDrawingProbe(3), IBM850, candidates(IBM850, ISO_8859_2));
+ assertEquals(IBM850, out, "nothing canonical to demote to without a
windows-1252 candidate");
+ }
+}