This is an automated email from the ASF dual-hosted git repository.
tallison pushed a commit to branch TIKA-4685-chardet
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/TIKA-4685-chardet by this push:
new 391f017e98 TIKA-4685 - tweaks
391f017e98 is described below
commit 391f017e980111199497ae30b132577cf2620c7f
Author: tballison <[email protected]>
AuthorDate: Fri Mar 6 16:26:35 2026 -0500
TIKA-4685 - tweaks
---
.../pages/advanced/charset-detection-design.adoc | 282 ++++---
docs/pom.xml | 14 +
tika-ml/tika-ml-chardetect/README.md | 872 +--------------------
.../chardetect/tools/BuildCharsetTrainingData.java | 13 +
.../ml/chardetect/tools/EvalCharsetDetectors.java | 21 +-
5 files changed, 265 insertions(+), 937 deletions(-)
diff --git a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
index 5f14041c40..94f72b1284 100644
--- a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
+++ b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
@@ -37,8 +37,8 @@ every detector runs regardless of what the others returned,
and the
| `MojibusterEncodingDetector`
| `tika-encoding-detector-mojibuster`
| ML byte-ngram classifier + structural rules. Returns up to N STATISTICAL
- candidates ranked by logit. Internally routes EBCDIC bytes through a
- two-stage sub-model (see <<ebcdic>>).
+ candidates ranked by logit. All charsets — including UTF-16/32 and all
+ EBCDIC variants — are direct labels in a single 37-class model.
| 4
| `StandardHtmlEncodingDetector`
@@ -63,7 +63,7 @@ Each `EncodingResult` carries:
* `charset` — the detected `java.nio.charset.Charset`
* `confidence` — 0.0–1.0 float
* `label` — the detector's original internal label (e.g. `IBM424-ltr`,
- `EBCDIC`) which may be finer-grained than the Java charset name
+ `UTF-16-BE`) which may be finer-grained than the Java charset name
* `resultType` — one of:
[cols="1,3", options="header"]
@@ -76,9 +76,9 @@ Each `EncodingResult` carries:
unless structurally impossible.
| `STRUCTURAL`
-| Derived from byte-level structure (UTF-8 validity, wide-Unicode null columns,
- EBCDIC space distribution). More reliable than statistics but less
- authoritative than an explicit declaration.
+| Derived from byte-level structure (UTF-8 validity, EBCDIC space
distribution).
+ More reliable than statistics but less authoritative than an explicit
+ declaration.
| `STATISTICAL`
| ML model output. Plausible but not certain; subject to arbitration by
@@ -103,7 +103,9 @@ When no bytes ≥ 0x80 are present,
`MojibusterEncodingDetector` returns
== MojibusterEncodingDetector
A multinomial logistic regression classifier trained on byte n-gram features
-extracted from the MADLAD-400 corpus and Cantonese Wikipedia.
+extracted from the MADLAD-400 corpus and Cantonese Wikipedia. All 37 charset
+classes — including UTF-16/32 variants and all EBCDIC variants — are direct
+labels in a single unified model.
=== Feature extraction
@@ -116,7 +118,7 @@ ignored without stripping.
captures multi-byte character structure (Shift_JIS, EUC-*, Big5, GB18030).
* **Stride-2 bigrams** — pairs sampled at even positions `(b[2i], b[2i+2])`;
gives the model structural visibility into UTF-16/32 null-column patterns
- without requiring the `WideUnicodeDetector` pre-filter for those encodings.
+ directly within the single unified model.
Features are FNV-1a hashed into **16 384 buckets** (≈ 400 KB model file).
@@ -126,15 +128,35 @@ Features are FNV-1a hashed into **16 384 buckets** (≈ 400
KB model file).
|===
| Rule | Trigger
-| `checkHz` | `~{` / `~}` escape sequences → HZ-GB-2312
| `detectIso2022` | ESC designation sequences → ISO-2022-JP / KR / CN
| `checkAscii` | No bytes ≥ 0x80 → `windows-1252` (pure ASCII default)
| `checkUtf8` | Structurally valid UTF-8 multi-byte sequences → `UTF-8`;
provably invalid sequences exclude UTF-8 from model candidates
| `checkIbm424` | EBCDIC space (0x40) dominance + Hebrew-range bytes →
IBM424-ltr/rtl
-| `checkIbm500` | EBCDIC space (0x40) + Latin letter cluster density → IBM500
|===
+NOTE: `checkHz` (HZ-GB-2312) and `checkIbm500` exist in
`StructuralEncodingRules`
+but are not called in the current detection path. HZ is vanishingly rare in
+practice; IBM500 and IBM1047 are treated as a confusable pair in
+`CharsetConfusables` (they differ in only 9 of 256 byte mappings, 5 of which
+are below 0x80 and invisible to the model).
+
+=== Candidate selection for short probes
+
+After the model scores all 37 classes, candidate selection works in two steps:
+
+. **Logit-gap window** (`selectByLogitGap`) — include all candidates whose
+ logit is within `LOGIT_GAP` (5.0) of the top logit.
+. **Short-probe floor** — for probes shorter than 50 bytes, if the gap window
+ returns fewer than `MIN_CANDIDATES` (8) results, `selectAtLeast(8)` extends
+ the window to the top 8 candidates by raw logit rank. This ensures that
+ CharSoup has a meaningful set of alternatives to arbitrate on short inputs
+ (e.g. ZIP entry filenames) where a single dominant-but-wrong logit would
+ otherwise shut out the correct answer.
+
+BOM bytes are stripped from the probe before feature extraction so that the
+BOM itself does not bias the byte-ngram features.
+
=== Post-model corrections
**ISO-8859-X → Windows-12XX upgrade** +
@@ -166,64 +188,19 @@ score. Grammar confidence (a count ratio) and model
confidence (sigmoid of a
logit) are different quantities on different scales; mixing them produces
apples-to-oranges comparisons that break downstream arbitration.
-[[ebcdic]]
-== Two-stage EBCDIC detection
-
-EBCDIC bytes are ambiguous to a general byte-ngram model because the 7
supported
-IBM variants share most of their byte distributions. Detection is split into
-two stages to keep the general model focused on the broad charset landscape
while
-giving EBCDIC discrimination its own tuned sub-model.
-
-=== Stage 1 — General model routing
-
-The general model (25 labels) includes a single routing label `"EBCDIC"`.
-When the top logit from the general model is `"EBCDIC"`, the result is not
-returned directly. `MojibusterEncodingDetector` routes the probe to the EBCDIC
-sub-model instead.
-
-NOTE: Individual IBM variant labels (`IBM500`, `IBM420`, etc.) do **not**
appear
-in the general model. They live only in the sub-model.
-
-=== Stage 2 — EBCDIC sub-model
-
-A separate smaller model (5 labels, 1 024 buckets) trained exclusively on the
-5 true EBCDIC variants:
-
-[cols="1,2", options="header"]
-|===
-| Label | Encoding
-
-| `IBM420-ltr` | EBCDIC Arabic, logical (left-to-right storage)
-| `IBM420-rtl` | EBCDIC Arabic, visual (right-to-left storage)
-| `IBM424-ltr` | EBCDIC Hebrew, logical
-| `IBM424-rtl` | EBCDIC Hebrew, visual
-| `IBM500` | International EBCDIC (Latin)
-|===
-
-NOTE: `IBM855` and `IBM866` are DOS/OEM Cyrillic code pages, not true EBCDIC.
-Their byte layouts are entirely different from EBCDIC (ASCII characters occupy
-the same positions as in other Latin encodings). The general model classifies
-them directly as Cyrillic encodings — they never trigger the `"EBCDIC"` routing
-label and therefore never reach the sub-model. They appear as direct labels in
-the general model alongside `windows-1251`, `KOI8-R`, and `x-mac-cyrillic`.
-
-The sub-model's narrower training distribution gives it much better
-discrimination between IBM variants than the general model could achieve with
-a single `"EBCDIC"` class.
-
-=== Structural pre-filters for EBCDIC
+=== EBCDIC variants as direct model labels
-Before the two-stage ML path, two structural rules catch the clearest EBCDIC
-cases immediately:
+All IBM EBCDIC variants are direct labels in the single 37-class model —
+there is no separate EBCDIC routing step or sub-model. The model handles:
-* `checkIbm424` — EBCDIC space (0x40) dominance + bytes in the Hebrew letter
- range (0x41–0x6A are all below 0x80, invisible to ML). Returns IBM424-ltr
- or IBM424-rtl without invoking either model.
-* `checkIbm500` — EBCDIC space (0x40) + Latin letter cluster density across six
- byte bands (0x81–0x89, 0x91–0x99, etc.). Returns IBM500 without invoking
- the ML path.
+* `IBM500` / `IBM1047` — Latin international EBCDIC (confusable pair, differ
in 9 of 256 byte positions)
+* `IBM424-ltr` / `IBM424-rtl` — EBCDIC Hebrew (ltr/rtl are the same code page;
`checkIbm424` fires first for clear cases)
+* `IBM420-ltr` / `IBM420-rtl` — EBCDIC Arabic (aspirational; training data
requires the `cp420` Python codec, unavailable on some platforms)
+* `IBM850` / `IBM852` / `IBM855` / `IBM866` — DOS/OEM code pages (not true
EBCDIC; byte layouts follow the ASCII/Latin convention)
-When neither structural rule fires, the two-stage ML path handles it.
+NOTE: `IBM855` and `IBM866` are DOS Cyrillic code pages, not EBCDIC. Their
+byte layouts are entirely different from EBCDIC and they are classified
directly
+alongside `windows-1251`, `KOI8-R`, and `x-mac-cyrillic`.
[[charsoup]]
== CharSoupEncodingDetector — language-signal arbitration
@@ -233,46 +210,40 @@ chain switches `CompositeEncodingDetector` into
collect-all mode. After all
other detectors run, CharSoup receives the full `EncodingDetectorContext` and
arbitrates.
-=== Arbitration rules (in priority order)
-
-. **DECLARATIVE wins outright** — if any detector returned a DECLARATIVE result
- (BOM, HTML `<meta>`, Content-Type header, metadata hint), it is returned
- unchanged. Statistical candidates are not considered.
+Before any charset decoding, CharSoup strips leading BOM bytes from the raw
+probe. This ensures every candidate charset decodes the same content bytes,
+preventing the BOM itself from skewing language scores.
-. **Single statistical candidate** — returned directly without language
scoring.
+=== Arbitration rules (in priority order)
-. **Multiple statistical candidates — language scoring** — for each candidate
- charset, the candidate bytes are decoded using that charset and fed to
- `CharSoupLanguageDetector` (a character-bigram language model). The charset
- whose decoded text produces the **highest maximum logit** across all ~165
- languages wins, provided that logit is **positive** (sigmoid > 0.5).
- A positive logit means the language model actively predicts _some_ real
- language in the decoded text; a non-positive logit means the text is too
- short, too junk-heavy, or too ambiguous for any language to stand out.
+. **Unanimous** — if all detectors agree (or only one result exists), return it
+ directly without language scoring.
+
+. **Language scoring** — for each unique candidate charset, the BOM-stripped
+ bytes are decoded using that charset and fed to `CharSoupLanguageDetector`
+ (a character-bigram language model covering ~165 languages). Candidates
+ whose decoded text exceeds a junk-character threshold (`MAX_JUNK_RATIO =
0.10`)
+ are discarded before scoring. The charset whose decoded text produces the
+ **highest maximum logit** across all languages wins, provided that logit is
+ **positive** (sigmoid > 0.5).
+
+. **DECLARATIVE preference** — after language scoring, if the winner is not a
+ DECLARATIVE result but a DECLARATIVE candidate exists, the DECLARATIVE result
+ is preferred when both of the following hold:
+ * Its decoded text has junk ratio ≤ the language winner's junk ratio (it
+ decodes at least as cleanly).
+ * Its decoded text has a positive language signal (max logit > 0).
+
-Before language scoring, candidates whose decoded text exceeds a junk-character
-ratio threshold (`MAX_JUNK_RATIO = 0.10`) are discarded. Junk characters are
-U+FFFD replacement characters and ISO C1 control characters
-(excluding ordinary whitespace).
+This handles the case where a valid BOM (e.g. `UTF-16BE`) is overridden by a
+wrong-endian decoding that happens to look like CJK text, which the language
+model scores more confidently than short Latin text. The junk guard prevents
+false positives from truly lying BOMs or wrong `<meta charset>` tags.
. **Inconclusive** — if no candidate's logit is positive (all decodings are too
- ambiguous for the language model to distinguish), CharSoup returns the
- first candidate from the highest-confidence statistical detector.
-
-=== Why "positive logit" is the only threshold
-
-Earlier versions used two thresholds: a minimum absolute confidence
-(`sigmoid ≥ 0.88`) and a minimum relative margin between best and runner-up.
-Both were removed. Rationale:
-
-* If the language model gives a positive logit for any language in the decoded
- text, it has a real signal. The best-logit candidate is better than all the
- alternatives by definition — requiring an additional margin just delays the
- correct answer.
-* The margin threshold was sensitive to which other charsets happened to be in
- the candidate set. A third candidate with a strong (but wrong) language
- signal could narrow the margin below threshold and force a fallback to the
- model's top statistical pick, which might be wrong.
+ ambiguous for the language model to distinguish), CharSoup falls back to the
+ DECLARATIVE result if one exists and its decoding is at least as clean as the
+ statistical winner; otherwise it returns the first candidate from the
+ highest-confidence statistical detector.
=== junkRatio
@@ -280,6 +251,8 @@ The junk ratio filters out clearly wrong-charset decodings
before language
scoring runs. Currently counts:
* U+FFFD replacement characters (wrong-charset multi-byte decode)
+* U+FFFE (the "wrong-endian BOM" / Unicode noncharacter — produced when a
+ UTF-16 BOM is decoded with the wrong byte order)
* ISO C1 control characters 0x00–0x08, 0x0E–0x1F, 0x80–0x9F (excluding TAB,
LF, VT, FF, CR which appear in source code and structured documents)
@@ -289,6 +262,21 @@ wrong-charset single-byte decoding of multi-byte lead
bytes, they also appear
legitimately in windows-125x documents (bullet lists, legal symbols). The
language model already handles this discrimination correctly.
+=== Why "positive logit" is the only threshold
+
+Earlier versions used two thresholds: a minimum absolute confidence
+(`sigmoid ≥ 0.88`) and a minimum relative margin between best and runner-up.
+Both were removed. Rationale:
+
+* If the language model gives a positive logit for any language in the decoded
+ text, it has a real signal. The best-logit candidate is better than all the
+ alternatives by definition — requiring an additional margin just delays the
+ correct answer.
+* The margin threshold was sensitive to which other charsets happened to be in
+ the candidate set. A third candidate with a strong (but wrong) language
+ signal could narrow the margin below threshold and force a fallback to the
+ model's top statistical pick, which might be wrong.
+
== MetadataCharsetDetector
A lightweight detector in `tika-core` that reads declarative charset hints from
@@ -301,8 +289,8 @@ Applies WHATWG label normalization: `ISO-8859-1` and
`US-ASCII` are mapped to
`windows-1252` because browsers (and the HTML5 spec) treat them as aliases for
windows-1252 in practice.
-Returns a DECLARATIVE result, so `CharSoupEncodingDetector` will not override
it
-with a statistical guess.
+Returns a DECLARATIVE result, so `CharSoupEncodingDetector` will treat it with
+preference over statistical candidates.
== BOMDetector
@@ -324,6 +312,68 @@ flag (default `false`) so that when `BOMDetector` is in
the chain, the HTML
detector does not also attempt BOM detection — keeping BOM arbitration in one
place.
+== Performance and accuracy
+
+Numbers are from the held-out MADLAD-400 + Cantonese Wikipedia (zh_yuewiki)
+test set. 326 397 samples across 43 charsets. "All" = ML model + all
+post-processing rules — this is the production configuration (there is no
+separate `WideUnicodeDetector` pre-filter; UTF-16/32 are direct model labels).
+Full per-charset results:
link:charset-detection-eval.txt[charset-detection-eval.txt]
+
+[cols="2,1,1,1", options="header"]
+|===
+| Metric | ML + rules (All) | ICU4J | juniversalchardet
+
+| Strict accuracy (full probe) | **88.7%** | 38.4% | 28.1%
+| Soft accuracy (full probe) | **93.4%** | 60.3% | 37.0%
+| Latency (full probe) | **~11 µs** | ~153 µs | ~18 µs
+| UTF-16/UTF-32 (full probe) | **~95%** | 84–100% | 0%
+| IBM424 / IBM420 / IBM500 | **Yes** | Partial | No
+| ISO-8859-3, TIS-620, KOI8-U, windows-1258 | **Yes** | No | No
+| x-mac-cyrillic, x-EUC-TW, ISO-8859-16 | **Yes** | No | No
+| Model size | **~400 KB** | ~12 MB (icu4j.jar) | ~100 KB
+|===
+
+**Strict accuracy** = exact charset name match. **Soft accuracy** = exact or
+confusable-group match (e.g. predicting IBM500 for an IBM1047 file counts as
+soft-correct since they share 247 of 256 byte mappings).
+
+NOTE: ICU4J's lower overall score reflects the expanded charset set — it has no
+support for many charsets the ML model handles (IBM850/852/855,
windows-1257/1258,
+x-mac-cyrillic, x-EUC-TW, ISO-8859-16, x-MacRoman). For the charsets ICU4J
+does support (CJK, Cyrillic, Arabic, Greek, Hebrew) it is competitive.
+
+Per-probe-length summary (All column, strict / soft):
+
+----
+Probe ML+rules R% S% ICU4J R% S% juniv R% S% µs (ML)
+ 20B 35.7 36.1 20.3 39.0 22.9 30.7 7.0
+ 50B 49.9 50.2 29.8 52.4 26.5 35.2 8.1
+ 100B 82.1 86.7 33.4 56.3 27.6 36.4 7.9
+ 200B 86.5 91.3 35.8 58.6 28.0 36.9 8.9
+full 88.7 93.4 38.4 60.3 28.1 37.0 11.4
+----
+
+=== Notes on specific charsets
+
+**GB18030 low strict / high soft**: the GB family is a strict subset chain
+(GB2312 ⊂ GBK ⊂ GB18030). When ML predicts GBK for a GB18030 file, the text
+decodes correctly unless the file uses GB18030-specific 4-byte sequences (rare
+minority-language characters). The `GB_FOUR_BYTE_UPGRADE` rule catches those
+cases. ICU4J's 99% strict score for GB18030 reflects grammar-based rules; the
+practical decoding difference is small for typical Han Chinese content.
+
+**IBM500 / IBM1047**: these share 247 of 256 byte mappings — the 9 positions
+that differ are mostly below 0x80 and invisible to high-byte features. For
+normal Latin prose they are genuinely indistinguishable. Both are listed in
+`CharsetConfusables` as a soft-confusable group; predicting either counts as a
+soft hit, and either charset decodes the other's content correctly for the vast
+majority of text.
+
+**IBM424 ~0% strict / ~100% soft**: the `checkIbm424` structural rule detects
+EBCDIC Hebrew correctly but does not determine directionality (ltr vs rtl), so
+the result is always the confusable-group partner — a soft hit.
+
== Configuration
=== Opt-out of the new pipeline (legacy behaviour)
@@ -359,3 +409,31 @@ scanning for `<meta charset>` tags. This can be tuned via
`tika-config.json`
]
}
----
+
+== Model training and evaluation
+
+Full documentation for rebuilding the training dataset, retraining the model,
+and running the evaluation harness is in
+`tika-ml/tika-ml-chardetect/README.md` in the source tree. The short version:
+
+[source,bash]
+----
+# 1. Build training data
+JAR=tika-ml/tika-ml-chardetect/target/tika-ml-chardetect-*-tools.jar
+java -cp $JAR org.apache.tika.ml.chardetect.tools.BuildCharsetTrainingData \
+ --madlad-dir ~/datasets/madlad/data \
+ --zh-yue-file ~/datasets/zh_yuewiki/sentences_zh_yue.txt \
+ --output-dir ~/datasets/charset-detect
+
+# 2. Train
+java -cp $JAR org.apache.tika.ml.chardetect.tools.TrainCharsetModel \
+ --data ~/datasets/charset-detect/train \
+ --output ~/datasets/chardetect.bin \
+ --buckets 16384 --epochs 5
+
+# 3. Evaluate
+java -cp $JAR org.apache.tika.ml.chardetect.tools.EvalCharsetDetectors \
+ --model ~/datasets/chardetect.bin \
+ --data ~/datasets/charset-detect/test \
+ --lengths 20,50,100,200,full
+----
diff --git a/docs/pom.xml b/docs/pom.xml
index 17be3f1d96..3759e2b7c4 100644
--- a/docs/pom.xml
+++ b/docs/pom.xml
@@ -39,6 +39,20 @@ under the License.
<build>
<plugins>
+ <plugin>
+ <groupId>org.apache.rat</groupId>
+ <artifactId>apache-rat-plugin</artifactId>
+ <configuration>
+ <inputExcludes>
+ <!-- AsciiDoc pages and plain-text data files do not
carry ASF headers -->
+ <inputExclude>modules/**/*.adoc</inputExclude>
+ <inputExclude>modules/**/*.txt</inputExclude>
+ <inputExclude>modules/**/*.yml</inputExclude>
+ <!-- UI assets: CSS, SVG, and Handlebars templates -->
+ <inputExclude>supplemental-ui/**</inputExclude>
+ </inputExcludes>
+ </configuration>
+ </plugin>
<!-- Ensure mvn clean removes Antora output -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
diff --git a/tika-ml/tika-ml-chardetect/README.md
b/tika-ml/tika-ml-chardetect/README.md
index b39f8211ac..e666e2204f 100644
--- a/tika-ml/tika-ml-chardetect/README.md
+++ b/tika-ml/tika-ml-chardetect/README.md
@@ -1,538 +1,28 @@
# tika-ml-chardetect — Charset Detection
-A lightweight, production-ready charset/encoding detector for Apache Tika.
-It is designed as a drop-in replacement for the existing ICU4J-based detector
-(`Icu4jEncodingDetector`) and integrates with the standard Tika
-`EncodingDetector` interface.
+A lightweight, production-ready charset/encoding detector for Apache Tika built
+on multinomial logistic regression over byte n-gram features.
----
+## Documentation
-## Drop-in Replacement for ICU4J and juniversalchardet
+Architecture, algorithm details, accuracy numbers, and comparison with ICU4J /
+juniversalchardet are in the main Tika docs:
-`MojibusterEncodingDetector` implements the same
`org.apache.tika.detect.EncodingDetector`
-interface as `Icu4jEncodingDetector` and `UniversalEncodingDetector`. There
are
-three ways to use it.
+**`docs/modules/ROOT/pages/advanced/charset-detection-design.adoc`**
-### Option A — SPI auto-discovery (simplest)
+## Rebuilding the model
-Add `tika-ml-chardetect` to your classpath alongside the existing Tika jars.
-`DefaultEncodingDetector` (which backs `AutoDetectReader` and the text parsers)
-loads all registered `EncodingDetector` implementations via `ServiceLoader`.
-`MojibusterEncodingDetector` carries a `@TikaComponent` annotation that the
Tika
-annotation processor uses at compile time to generate the
-`META-INF/services/org.apache.tika.detect.EncodingDetector` SPI entry, so
-it is discovered automatically.
+### Prerequisites
-**Ordering caveat**: when multiple jars each provide an
-`EncodingDetector` service entry, `DefaultEncodingDetector` uses the first
-non-`null` result across all of them. The order in which jars are consulted
-depends on classloader ordering (typically classpath order). If you need a
-guaranteed result, use Option B or C.
+* Java 17+
+* MADLAD-400 corpus (`sentences_madlad.txt` per language, one sentence per
line)
+* Cantonese Wikipedia sentences for Big5/Big5-HKSCS (see adoc above)
-```xml
-<!-- Maven dependency -->
-<dependency>
- <groupId>org.apache.tika</groupId>
- <artifactId>tika-ml-chardetect</artifactId>
- <version>${tika.version}</version>
-</dependency>
-```
-
-### Option B — Explicit Java API
-
-Construct `MojibusterEncodingDetector` directly, bypassing the service registry
-entirely. Use this when you control the calling code and want a single,
-deterministic detector.
-
-```java
-import org.apache.tika.detect.EncodingDetector;
-import org.apache.tika.detect.WideUnicodeDetector;
-import org.apache.tika.io.TikaInputStream;
-import org.apache.tika.metadata.Metadata;
-import org.apache.tika.ml.chardetect.MojibusterEncodingDetector;
-import org.apache.tika.parser.ParseContext;
-
-// Full production pipeline: WideUnicodeDetector first, then ML.
-WideUnicodeDetector wide = new WideUnicodeDetector();
-MojibusterEncodingDetector ml = new MojibusterEncodingDetector(); // loads
bundled model
-
-try (TikaInputStream tis = TikaInputStream.get(bytes)) {
- Metadata meta = new Metadata();
- ParseContext ctx = new ParseContext();
- List<EncodingResult> result = wide.detect(tis, meta, ctx);
- if (result.isEmpty()) {
- result = ml.detect(tis, meta, ctx);
- }
-}
-```
-
-The `WideUnicodeDetector` pre-step is optional but recommended: it handles
-UTF-16/32 via null-byte structural analysis before the ML model runs, matching
-what `EvalCharsetDetectors` calls the "Pipeline" configuration.
-
-### Option C — Tika JSON configuration (exclusive replacement)
-
-```json
-{
- "encodingDetectors": [
- { "type": "mojibuster-encoding-detector" }
- ]
-}
-```
-
-Load the config at startup:
-
-```java
-TikaConfig config = TikaConfig.load(Paths.get("tika-config.json"));
-AutoDetectParser parser = new AutoDetectParser(config);
-```
-
-### Comparison with existing detectors
-
-Numbers are from the held-out MADLAD + zh_yuewiki test set at **full probe
-length** (5 000 samples/charset, 36 charsets including structurally-detected
-ones like HZ and US-ASCII). "Pipeline" = `WideUnicodeDetector` +
-`MojibusterEncodingDetector`; see [Evaluation Results](#evaluation-results)
for the
-full per-length table.
-
-| Feature | **Pipeline** | `Icu4jEncodingDetector` |
`UniversalEncodingDetector` |
-|---|---|---|---|
-| Approach | Wide-unicode pre-filter + structural rules + ML | Byte-frequency
statistics (ICU4J) | Mozilla-derived heuristics |
-| Overall strict accuracy (full) | **78.0%** | 51.7% | 40.7% |
-| Overall soft accuracy (full) | **91.4%** | 74.0% | 56.5% |
-| Latency (full probe) | **8.5 µs/call** | 167 µs/call | 18 µs/call |
-| UTF-16/UTF-32 | **98–100%** | 86–100% | <1% |
-| IBM424 / IBM500 / IBM855 | **Yes** | Partial | No |
-| ISO-8859-3, TIS-620, KOI8-U, windows-1258 | **Yes** | No | No |
-| x-mac-cyrillic | **Yes** | No | Partial |
-| Model size | **257 KB** (8 192 buckets) | icu4j.jar (~12 MB) | ~100 KB |
-| External dependencies | None (model bundled) | `com.ibm.icu:icu4j` |
`com.github.albfernandez:juniversalchardet` |
-
----
-
-## Algorithm Overview
-
-Detection is a three-tier pipeline: wide-character pre-filter → structural
-rules → statistical model.
-
-### Tier 0 — Wide Unicode pre-filter (`WideUnicodeDetector`)
-
-Runs before anything else. Identifies UTF-16 LE/BE and UTF-32 LE/BE purely
-from structural null-byte patterns (BOM or column-based analysis). Because
-wide encodings have systematic 0x00 bytes in predictable positions, this
-detector achieves 98–100% accuracy at probe lengths as short as 20 bytes.
-If it fires, the ML model is never invoked.
-
-### Tier 1 — Structural Rules (`StructuralEncodingRules`)
-
-Fast, deterministic checks that run before the statistical model and return
-a definitive answer when possible.
-
-| Rule | Encodings detected | Rationale |
-|---|---|---|
-| `checkHz` | HZ-GB-2312 | `~{`/`~}` switching sequences are unique |
-| `detectIso2022` | ISO-2022-JP, ISO-2022-KR, ISO-2022-CN | ESC designation
sequences are unique to the ISO-2022 family |
-| `checkAscii` | UTF-8 (= US-ASCII) | No bytes ≥ 0x80 → pure 7-bit ASCII, a
strict UTF-8 subset |
-| `checkIbm424` | IBM424-ltr / IBM424-rtl | Hebrew letters (0x41–0x6A) and
EBCDIC space (0x40) are all below 0x80, invisible to the ML model |
-| `checkIbm500` | IBM500 | EBCDIC space (0x40) dominance + high-byte Latin
letter density (six clusters 0x81–0xE9) |
-| `checkUtf8` (negative only) | — | Provably invalid UTF-8 sequences exclude
UTF-8 from the model's candidate set |
-
-**GB18030 4-byte upgrade** (post-model,
`StructuralEncodingRules.hasGb18030FourByteSequence`):
-GB18030-specific 4-byte sequences have digit bytes (0x30–0x39) in the second
-and fourth positions, which is impossible in GBK/GB2312 (trail bytes are
-0x40–0xFE). A single matching 4-tuple is definitive proof that a GB18030
-codec is required. Applied after the model: if the model returns GBK or
-GB2312 and such a sequence is found in the probe, the result is upgraded to
-GB18030.
-
-**ISO-8859 → Windows-12XX upgrade** (post-model, `upgradeIsoToWindows`):
-C1 bytes (0x80–0x9F) are control characters in every ISO-8859-X standard but
-printable characters in every Windows-12XX encoding. Their presence is
-definitive proof the content is not ISO-8859-X; the corresponding Windows
-variant is substituted while preserving the model's confidence score.
-
-See [Windows-12XX vs ISO-8859-X](#windows-12xx-vs-iso-8859-x) below for the
-full rationale and the one exception (ISO-8859-3).
-
-### Tier 2 — Statistical Model (`MojibusterEncodingDetector`)
-
-A **multinomial logistic regression** classifier trained on byte n-gram
-features. Handles ambiguous cases that structural rules cannot resolve:
-single-byte encodings sharing the same script (KOI8-R vs windows-1251 vs
-IBM855), CJK multibyte families (GBK vs GB2312 vs GB18030), and everything
-else.
-
-**Features (`ByteNgramFeatureExtractor`)**
-
-Only bytes ≥ 0x80 contribute features. HTML tag markup (all ASCII) is
-ignored automatically; no HTML stripping is needed.
-
-- **Unigrams**: each high byte hashed individually — encodes byte-frequency
- distributions that separate SBCS encodings.
-- **Bigrams**: consecutive pair `(b[i], b[i+1])` where `b[i] ≥ 0x80` —
- captures multibyte character structure (Big5 lead/trail, GBK, Shift-JIS,
- EUC-* pairs).
-
-Features are hashed with FNV-1a into a fixed-width bucket array. The
-production model uses **8 192 buckets** (257 KB); evaluation across 4 096,
-8 192, 16 384, 32 768, and 65 536 buckets showed negligible accuracy
-differences (< 0.5 pp strict at full length), confirming this dataset is
-not bucket-limited.
-
-**Training**: multinomial logistic regression with SGD, learning rate 0.05,
-5 epochs, no L2 regularisation.
-
-**Confusable sets (`CharsetConfusables`)**: groups of encodings that are
-difficult or impossible to distinguish at the byte level (e.g. GBK ⊂ GB18030,
-Big5 ⊂ Big5-HKSCS) are defined in `CharsetConfusables`. During inference,
-probability mass is pooled across each group. During evaluation, predicting
-any group member counts as a "soft" hit.
-
-**CJK grammar walkers** (`CjkEncodingRules`, `Rule.CJK_GRAMMAR`): after the
-model nominates a CJK encoding, a grammar walker validates the byte sequences
-against the encoding's formal grammar (Shift_JIS, EUC-JP, EUC-KR, Big5,
-GB18030). A grammar score of 0 means the probe violates the encoding's byte
-grammar and the candidate is dropped; scores above 10 replace the model
-confidence with a grammar-derived confidence.
-
-See [CJK Grammar Walkers](#cjk-grammar-walkers) below for the per-encoding
-byte rules and the Shift_JIS design considerations.
-
----
-
-## Windows-12XX vs ISO-8859-X
-
-Every ISO-8859-X encoding reserves bytes `0x80–0x9F` for C1 control characters.
-These bytes never appear in real text — they are escape sequences and device
-controls that have been obsolete since the 1980s. The corresponding
-Windows-12XX encodings use that same range for printable characters: curly
-quotes, em-dash, Euro sign, ellipsis, and similar typographic symbols common
-in modern documents.
-
-This means that any byte in `0x80–0x9F` is **impossible** in ISO-8859-X but
-entirely normal in Windows-12XX. A single C1 byte is therefore definitive
-proof that the file uses the Windows variant.
-
-### Design choice: train on Windows variants, upgrade at inference
-
-Rather than training the model on both ISO and Windows variants (which would
-split probability mass between near-identical byte distributions), we:
-
-1. **Train only on Windows-12XX variants** — the model learns the Windows
- character distributions directly.
-2. **At inference, if C1 bytes are present**, apply `upgradeIsoToWindows` to
- replace any ISO-8859-X result with its Windows-12XX equivalent.
-3. **At inference, if no C1 bytes are present**, the ISO and Windows variants
- are byte-for-byte identical in the non-C1 range — the choice of label is
- arbitrary, and `CharsetConfusables` treats them as a symmetric confusable
- group so neither counts as an error in evaluation.
-
-The full ISO-to-Windows mapping (from `CharsetConfusables.ISO_TO_WINDOWS`):
-
-| ISO-8859-X | Windows equivalent | Script |
-|---|---|---|
-| ISO-8859-1 | windows-1252 | Western European (Latin-1) |
-| ISO-8859-15 | windows-1252 | Western European (Latin-9, adds € and four
other chars) |
-| ISO-8859-2 | windows-1250 | Central / Eastern European |
-| ISO-8859-4 | windows-1257 | Baltic |
-| ISO-8859-5 | windows-1251 | Cyrillic |
-| ISO-8859-6 | windows-1256 | Arabic |
-| ISO-8859-7 | windows-1253 | Greek |
-| ISO-8859-8 | windows-1255 | Hebrew |
-| ISO-8859-9 | windows-1254 | Turkish (Latin-5) |
-| ISO-8859-13 | windows-1257 | Baltic (Estonian/Latvian/Lithuanian) |
-
-Note that both ISO-8859-1 and ISO-8859-15 map to windows-1252. ISO-8859-15
-differs from ISO-8859-1 only in eight code points in the `0xA0–0xFF` range
-(adding `€`, `Š`, `š`, `Ž`, `ž`, `Œ`, `œ`, `Ÿ`); its C1 range is equally
-unused. The `CharsetConfusables` symmetric group `{ISO-8859-1, ISO-8859-15,
-windows-1252}` handles this: any of the three is acceptable as a lenient match.
-
-### The ISO-8859-3 exception
-
-**ISO-8859-3** (Latin-3, used primarily for Maltese) has no Windows equivalent.
-The characters unique to Maltese — `ħ`, `ġ`, `ċ`, `ż` and their uppercase
forms —
-are not representable in any Windows-12XX code page. ISO-8859-3 is therefore
-retained as a distinct training class and is never subject to the Windows
-upgrade rule.
-
-This is the only ISO-8859 variant kept in the model.
-
----
-
-## CJK Grammar Walkers
-
-The ML model reliably separates CJK content from non-CJK content but can
-confuse related CJK encodings — most critically **Shift_JIS vs EUC-JP**, where
-both encode Japanese and language-level arbitration cannot help. Grammar
-walkers in `CjkEncodingRules` provide a hard structural check: if the model
-nominates a CJK encoding, the walker either confirms it, weakly confirms it,
-or rejects it based purely on byte-sequence validity.
-
-Walkers are only invoked when the model has already placed a CJK encoding in
-its output — never unconditionally — to avoid false positives on Latin, Arabic,
-or Cyrillic content where some byte values happen to fall in CJK lead-byte
-ranges.
-
-### Confidence scoring (ICU4J-inspired)
-
-Each walker returns a score on a 0–100 scale:
-
-| Score | Meaning |
-|---|---|
-| 0 | Invalid byte sequences detected — reject this encoding |
-| 10 | Valid grammar but ≤ 10 double-byte characters — too little evidence;
retain model confidence |
-| 11–100 | `30 + doubleByte − 20 × bad`, capped at 100 — replace model
confidence with grammar confidence |
-
-Early exit: bail when `bad ≥ 2` and `bad × 5 ≥ doubleByte` (bad sequences
-outnumber good at more than 1:5 — same condition as ICU4J's
`CharsetRecog_mbcs`).
-
-### Per-encoding byte rules
-
-**Shift_JIS / windows-31j (CP932)**
-
-Shift_JIS has three byte categories:
-- `0x00–0x7F` — single-byte ASCII
-- `0xA1–0xDF` — single-byte half-width katakana
-- `0x81–0x9F` or `0xE0–0xFC` — lead byte of a double-byte character; trail
- must be `0x40–0x7F` or `0x80–0xFF` (note: `0x7F` is excluded in strict JIS
- but the grammar walker permits it since CP932 uses it)
-- `0x80`, `0xA0`, `0xFD–0xFF` — invalid
-
-**Shift_JIS / IBM424 ambiguity**: EBCDIC space is `0x40`. In Shift_JIS, `0x40`
-is a valid trail byte (it encodes several double-byte characters when preceded
-by a lead byte in `0x81–0x9F` or `0xE0–0xFC`). The `checkIbm424` structural
-rule guards against this false positive: when counting `0x40` as EBCDIC space,
-it first checks whether the preceding byte is a Shift_JIS lead byte and, if so,
-discounts the `0x40` from the EBCDIC space count. The same guard is applied in
-`checkIbm500`.
-
-**EUC-JP**
-- `0x00–0x8D` — single byte
-- `0xA1–0xFE` — lead byte; trail must be `≥ 0xA1`
-- `0x8E` — SS2 (single-shift 2, half-width katakana); next byte `≥ 0xA1`
-- `0x8F` — SS3 (single-shift 3, JIS X 0212 supplementary); next two bytes both
`≥ 0xA1`
-- `0x8E–0x9F` or `0xFF` — invalid outside SS2/SS3 context
-
-**EUC-KR**: same two-byte structure as EUC-JP without SS2/SS3 extensions.
-The shared EUC walker handles both, since any SS2/SS3 sequences that
-validate grammatically are simply counted as double-byte characters.
-
-**Big5 / Big5-HKSCS**
-- `0x00–0x7F` or `0xFF` — single byte
-- `0x81–0xFE` — lead byte; trail must be `0x40–0x7E` or `0x80–0xFE`
- (`0x7F` and `0xFF` are explicitly invalid trail bytes)
-- `0x80` — invalid
-
-The same walker covers both Big5 and Big5-HKSCS since their byte grammars are
-identical; the difference is only in which code points the double-byte pairs
-map to.
-
-**GB18030 / GBK / GB2312**
-- `0x00–0x80` — single byte
-- `0x81–0xFE` — lead byte of either:
- - A **two-byte** sequence: trail `0x40–0x7E` or `0x80–0xFE` (GBK/GB2312
range)
- - A **four-byte** sequence: second `0x30–0x39`, third `0x81–0xFE`, fourth
- `0x30–0x39` (GB18030-only extension for rare Unicode code points)
-
-The four-byte case is the structural fingerprint used by
-`hasGb18030FourByteSequence` to upgrade GBK/GB2312 model predictions to
-GB18030 when minority-language or rare-Unicode content is present. The digit
-bytes (`0x30–0x39`) in positions 2 and 4 are impossible in GBK/GB2312 trail
-position, making a single valid 4-tuple unambiguous.
-
----
-
-## Supported Encodings
-
-The pipeline supports two tiers of detection:
-
-- **Structural-only**: detected by deterministic rules before the model runs;
- if the rule fires the model is never invoked
-- **ML + structural**: in the ML model's 32-class training set, but also
- covered by a structural pre-filter that fires first on clear cases
-- **ML-only**: detected solely by the statistical model
-
-### Full encoding list
-
-| Encoding | Detection | Family | Notes |
-|---|---|---|---|
-| UTF-8 | ML + structural | Unicode | `checkAscii` → UTF-8 for pure-7-bit
input; model handles high-byte UTF-8 |
-| UTF-16-LE | ML + structural | Unicode | ML class; `WideUnicodeDetector`
fires first via null-byte analysis |
-| UTF-16-BE | ML + structural | Unicode | ML class; `WideUnicodeDetector`
fires first via null-byte analysis |
-| UTF-32-LE | ML + structural | Unicode | ML class; `WideUnicodeDetector`
fires first via null-byte analysis |
-| UTF-32-BE | ML + structural | Unicode | ML class; `WideUnicodeDetector`
fires first via null-byte analysis |
-| US-ASCII | Structural only | Unicode | `checkAscii` → UTF-8 (ASCII is a
strict UTF-8 subset); model never runs |
-| Shift_JIS | ML-only | Japanese | CJK grammar walker validates lead/trail
byte structure post-model |
-| EUC-JP | ML-only | Japanese | CJK grammar walker with SS2/SS3 extension
handling |
-| ISO-2022-JP | Structural only | Japanese | `detectIso2022` ESC sequence `ESC
$ B` or `ESC $ @` |
-| EUC-KR | ML-only | Korean | CJK grammar walker (shares EUC structure with
EUC-JP) |
-| ISO-2022-KR | Structural only | Korean | `detectIso2022` ESC sequence `ESC $
) C` |
-| GB18030 | ML-only | Chinese (Simplified) | CJK grammar walker; 4-byte
sequences upgrade GBK/GB2312 predictions |
-| GBK | ML-only | Chinese (Simplified) | CJK grammar walker |
-| GB2312 | ML-only | Chinese (Simplified) | CJK grammar walker |
-| Big5 | ML-only | Chinese (Traditional) | CJK grammar walker; sourced from
Cantonese Wikipedia |
-| Big5-HKSCS | ML-only | Chinese (Traditional) | CJK grammar walker; superset
of Big5 |
-| HZ-GB-2312 | Structural only | Chinese (Simplified) | `checkHz` — `~{`/`~}`
switching markers; 7-bit encoding, no high bytes |
-| ISO-2022-CN | Structural only | Chinese (Simplified) | `detectIso2022` ESC
sequence `ESC $ ) A` or `ESC $ ) G` |
-| windows-1252 | ML-only | Western European | Covers ISO-8859-1 and
ISO-8859-15 via C1-byte upgrade rule |
-| windows-1250 | ML-only | Central/Eastern European | Covers ISO-8859-2 via
C1-byte upgrade rule |
-| windows-1251 | ML-only | Cyrillic | Covers ISO-8859-5 via C1-byte upgrade
rule |
-| windows-1253 | ML-only | Greek | Covers ISO-8859-7 via C1-byte upgrade rule |
-| windows-1254 | ML-only | Turkish | Covers ISO-8859-9 via C1-byte upgrade
rule |
-| windows-1255 | ML-only | Hebrew | Covers ISO-8859-8 via C1-byte upgrade rule
|
-| windows-1256 | ML-only | Arabic | Covers ISO-8859-6 via C1-byte upgrade rule
|
-| windows-1257 | ML-only | Baltic | Covers ISO-8859-4 and ISO-8859-13 via
C1-byte upgrade rule |
-| windows-1258 | ML-only | Vietnamese | No ISO-8859 equivalent; uses combining
diacritics (NFD-style) |
-| ISO-8859-3 | ML-only | Maltese | Only ISO-8859-X without a Windows
equivalent (ħ, ġ, ċ, ż) |
-| KOI8-R | ML-only | Cyrillic | Russian; soft-confusable with KOI8-U |
-| KOI8-U | ML-only | Cyrillic | Ukrainian; soft-confusable with KOI8-R |
-| IBM855 | ML-only | Cyrillic | Old Cyrillic EBCDIC |
-| IBM866 | ML-only | Cyrillic | DOS Cyrillic |
-| x-mac-cyrillic | ML-only | Cyrillic | Mac OS Cyrillic |
-| TIS-620 | ML-only | Thai | Also known as CP874 |
-| IBM500 | ML + structural | EBCDIC | `checkIbm500` fires first (EBCDIC space
+ Latin letter density); model is fallback |
-| IBM424-ltr | ML + structural | EBCDIC Hebrew | `checkIbm424` fires first;
model resolves ltr/rtl when rule is insufficient |
-| IBM424-rtl | ML + structural | EBCDIC Hebrew | Same code page as IBM424-ltr,
differs only in text-reversal convention |
-| IBM420-ltr | Aspirational | EBCDIC Arabic | In `LANG_CHARSETS`; skipped when
Python `cp420` codec is unavailable |
-| IBM420-rtl | Aspirational | EBCDIC Arabic | Same; no structural rule
implemented yet |
-
-### Counts
-
-| Category | Count |
-|---|---|
-| ML model classes | 32 |
-| Structural-only (no ML) | 6 (US-ASCII, UTF-16/32 ×4 via Wide,
ISO-2022-JP/KR/CN, HZ) |
-| ML + structural pre-filter | 5 (UTF-16/32 ×4, IBM500, IBM424-ltr/rtl) |
-| Aspirational (codec-dependent) | 2 (IBM420-ltr/rtl) |
-| **Total encodings handled** | **43** |
-
-> **Note on IBM420**: Arabic EBCDIC is included in the language-to-charset
-> mapping but training is skipped at runtime when the `cp420` Python codec is
-> unavailable (common on macOS and Python 3.14+). No structural rule for
-> IBM420 exists yet. Contributions welcome.
-
----
-
-## Data Sources
-
-Training, devtest, and test splits are generated from two sources:
-
-### 1. MADLAD-400
-
-[MADLAD-400](https://arxiv.org/abs/2309.04662) is a multilingual document-level
-dataset covering 400+ languages, released under Creative Commons. For this
-project the per-language `sentences_madlad.txt` files are used (one sentence
-per line, no tab prefix). It provides the primary training signal for all
-charsets except Traditional Chinese.
-
-Download with the bundled helper:
-
-```bash
-python download_madlad.py ~/datasets/madlad/data
-```
-
-### 2. Cantonese Wikipedia (zh_yuewiki)
-
-Big5 and Big5-HKSCS require Traditional Chinese text. MADLAD's Traditional
-Chinese coverage is limited to Simplified-leaning sources; Cantonese Wikipedia
-(`zh_yuewiki`) provides a corpus of ~940 000 clean Cantonese Traditional
-Chinese sentences extracted from the MediaWiki XML dump.
-
-**Why zh_yuewiki:**
-- Native Traditional Chinese script (not OpenCC-converted)
-- Large enough for 20 000 train + 2 000 devtest + 5 000 test samples each for
- Big5 and Big5-HKSCS without repetition
-- 100% of sentences round-trip through both Big5 and Big5-HKSCS codecs without
- loss (typical Cantonese text stays within the common Big5 code range)
-
-**Extraction**: The XML dump cannot be processed with `wikiextractor` on
-Python 3.14+ due to regex compatibility issues. A custom script
-`extract_wiki_sentences.py` (in `src/test/python/`) parses the bzip2-compressed
-XML directly, strips wikitext markup with lightweight regexes, and writes one
-sentence per line.
-
-```bash
-# Download the Cantonese Wikipedia dump (~123 MB compressed)
-wget
https://dumps.wikimedia.org/zh_yuewiki/latest/zh_yuewiki-latest-pages-articles.xml.bz2
-
-# Extract sentences (~940K sentences, ~130 MB plain text)
-python extract_wiki_sentences.py zh_yuewiki-latest-pages-articles.xml.bz2 \
- > ~/datasets/zh_yuewiki/sentences_zh_yue.txt
-```
-
-### Encoding coverage
-
-Not all source languages can be encoded into all charsets. The following
-charsets were **excluded** because they are superseded by a Windows equivalent
-and the ISO/Windows distinction is better resolved by the C1-byte structural
-rule at inference time:
-
-> ISO-8859-1, ISO-8859-2, ISO-8859-4, ISO-8859-5, ISO-8859-6,
-> ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15
-
-**ISO-8859-3 is kept** (Maltese — ħ, ġ, ċ, ż are not representable in any
-Windows charset).
-
-**EUC-TW is omitted**: Python has no `euc_tw` codec (JDK-only extension)
-and the encoding is vanishingly rare in practice, superseded by Big5 and UTF-8.
-
----
-
-## Generating the Dataset
-
-Data generation is handled entirely by the Java tool
-`BuildCharsetTrainingData` (in `src/main/java/.../tools/`). It replaces the
-former Python `build_charset_training.py` script, which was removed because
-Java supports charsets unavailable in CPython's standard codec library
-(IBM1047, x-EUC-TW, IBM420, IBM424) and because eliminating the
-Python / `ebcdic` / fastText dependency chain simplifies the build.
-
-### Step 1 — Build training, devtest, and test splits
-
-`BuildCharsetTrainingData` loads sentences from MADLAD (and zh_yuewiki for
-the `yue` virtual language), applies quality gates, and writes per-charset
-gzipped binary files in each split directory.
-
-**Encoding quality gates (both must pass):**
-
-1. **Drop-count gate**: text is encoded with `IGNORE` (unencodable characters
- are silently dropped) and decoded back with `REPLACE` (corrupt byte
- sequences become U+FFFD). A sentence is rejected if
- `(dropped characters + U+FFFD count) > 3`. This allows one or two
- typographic characters (curly quotes, em-dash) without discarding an
- otherwise useful sentence, while still catching sentences where substantial
- content is lost.
-
-2. **High-byte ratio gate**: the encoded chunk must have enough bytes ≥ 0x80
- to carry encoding-specific signal. Thresholds by family:
-
- | Encoding family | Min high-byte ratio | Rationale |
- |---|---|---|
- | CJK multibyte (Big5, EUC-*, GB18030, Shift_JIS, x-EUC-TW) | ≥ 20% | Each
CJK character = 2 high bytes; a sparse chunk indicates wrong source language or
heavily ASCII content |
- | UTF-8 | ≥ 5% | Ensures enough lead/continuation byte pairs for the model
to learn multi-byte structure |
- | SBCS / EBCDIC | ≥ 2% | Even 1 accented character per 50 bytes is
informative; rejects pure-ASCII sentences that look identical across all SBCS
charsets |
- | US-ASCII, ISO-2022-*, UTF-16/32 | exempt | No high bytes by design; these
are detected structurally, not by the ML model |
-
-**Split sizes:**
-
-The tool uses different sample caps for CJK/Unicode charsets and for
-SBCS/EBCDIC charsets. See [Training Data Size
Rationale](#training-data-size-rationale)
-below for the full derivation.
-
-| Charset family | train | devtest | test |
-|---|---|---|---|
-| CJK multibyte (Shift_JIS, EUC-JP, EUC-KR, GB18030, Big5-HKSCS, x-EUC-TW) |
20 000 | 2 000 | 5 000 |
-| Unicode (UTF-8, UTF-16-LE/BE, UTF-32-LE/BE) | 20 000 | 2 000 | 5 000 |
-| Structural-only (US-ASCII, ISO-2022-*) | — (no train) | 2 000 | 5 000 |
-| SBCS and EBCDIC (all others) | **50 000** | 2 000 | **10 000** |
-
-Build the fat JAR once, then run:
+### Step 1 — Build training data
```bash
-# Build the tools JAR
mvn package -pl tika-ml/tika-ml-chardetect -am -Ptrain -DskipTests \
- -Dmaven.repo.local=.local_m2_repo -Dcheckstyle.skip=true -q
+ -Dcheckstyle.skip=true -q
JAR=tika-ml/tika-ml-chardetect/target/tika-ml-chardetect-*-tools.jar
@@ -540,333 +30,47 @@ java -cp $JAR \
org.apache.tika.ml.chardetect.tools.BuildCharsetTrainingData \
--madlad-dir ~/datasets/madlad/data \
--zh-yue-file ~/datasets/zh_yuewiki/sentences_zh_yue.txt \
- --output-dir ~/datasets/madlad/charset-detect3
+ --output-dir ~/datasets/charset-detect
```
-The tool prints a summary of samples written per charset per split and writes
-a `manifest.json` describing the full run.
-
### Step 2 — Train
-The fat JAR built in Step 1 contains `TrainCharsetModel` as well:
-
```bash
-JAR=tika-ml/tika-ml-chardetect/target/tika-ml-chardetect-*-tools.jar
-
java -cp $JAR \
- org.apache.tika.ml.chardetect.tools.TrainCharsetModel \
- --data ~/datasets/madlad/charset-detect3/train \
- --output ~/datasets/madlad/chardetect.bin \
- --buckets 16384 \
- --epochs 3
-```
-
-The trainer prints per-epoch loss and in-sample accuracy, then a per-charset
-breakdown on the training data. Copy the model to its classpath resource
-location to make it the active bundled model:
+ org.apache.tika.ml.chardetect.tools.TrainCharsetModel \
+ --data ~/datasets/charset-detect/train \
+ --output ~/datasets/chardetect.bin \
+ --buckets 16384 \
+ --epochs 5
-```bash
-cp ~/datasets/madlad/chardetect.bin \
- src/main/resources/org/apache/tika/ml/chardetect/chardetect.bin
+# Install as the bundled model
+cp ~/datasets/chardetect.bin \
+
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/org/apache/tika/ml/chardetect/chardetect.bin
```
### Step 3 — Evaluate
```bash
java -cp $JAR \
- org.apache.tika.ml.chardetect.tools.EvalCharsetDetectors \
- --model ~/datasets/madlad/chardetect.bin \
- --data ~/datasets/madlad/charset-detect3/test \
- --lengths 20,50,100,200,full \
- --confusion
+ org.apache.tika.ml.chardetect.tools.EvalCharsetDetectors \
+ --model ~/datasets/chardetect.bin \
+ --data ~/datasets/charset-detect/test \
+ --lengths 20,50,100,200,full \
+ --confusion
```
-The tool reports six columns per charset:
+## Data sources
-| Column | Description |
+| Source | Usage |
|---|---|
-| `Stat` | ML model only, no post-processing rules |
-| `+ISO` | + C1-byte → Windows-12XX upgrade |
-| `+CJK` | + CJK grammar walkers |
-| `All` | All ML rules enabled (no Wide pre-filter) |
-| `Pipeline` | `WideUnicodeDetector` + all ML rules (production configuration)
|
-| `ICU4J` | `Icu4jEncodingDetector` baseline |
-| `juniv` | `UniversalEncodingDetector` baseline |
-
-Each column shows **R%** (strict — exact charset match) and **S%** (soft —
-exact or confusable-group match). A timing row (`µs/sample`) is printed below
-each probe-length section.
-
----
-
-## Training Data Size Rationale
-
-This section documents why the train/devtest/test sample counts were chosen,
-including the empirical experiments that justified the SBCS/EBCDIC uplift to
-50 000 training samples.
-
-### Model parameter count
-
-The production model is a multinomial logistic regression with
-`numBuckets × numClasses` weights. At 16 384 buckets and 38 trained classes:
-
-```
-16 384 × 38 = 622 592 weight parameters + 38 biases ≈ 623 K parameters
-```
-
-At 20 000 samples/class × 38 classes = 760 K total training samples, the
-overall samples-to-parameters ratio is about **1.2 : 1** — tight by dense
-neural-network standards but appropriate for a sparse linear classifier,
-where each sample only activates ≈ 50–150 of the 16 384 buckets and leaves
-the rest unchanged.
-
-### Why CJK and Unicode are capped at 20 000
-
-CJK charsets have overwhelming signal per sample:
-
-- Every CJK character encodes to 2–4 bytes that are all ≥ 0x80. A 256-byte
- Shift_JIS chunk contains ≈ 100 lead/trail bigrams — far more distinctive
- features than any SBCS chunk of the same length.
-- The byte grammars of Shift_JIS, EUC-JP, EUC-KR, GB18030, and Big5 are
- disjoint in their lead-byte ranges, giving the model near-perfect
- separability even at short probe lengths.
-- In-sample accuracy for all CJK and Unicode charsets is consistently
- ≥ 99.7% from the first epoch. More data cannot improve what is already
- converged.
-
-UTF-16/32 are equally clear-cut: stride-2 bigrams at even positions expose the
-code-unit structure directly (BMP UTF-16LE produces dense `(XX, 0x00)` pairs;
-UTF-32 produces `(0x00, 0x00)` pairs every other stride-2 position).
-
-### Why SBCS/EBCDIC use 50 000
-
-Single-byte charsets differ in how they assign the 128 positions `0x80–0xFF`.
-The challenge is that adjacent or related charsets share most of that range and
-differ only in a handful of positions. Two pairs expose this cleanly:
-
-**IBM500 vs IBM1047 — 9 differing byte positions**
-
-IBM1047 (EBCDIC Open Systems Latin-1, used on z/OS Unix) and IBM500 (classic
-international EBCDIC) share 247 of 256 byte mappings. The 9 that differ:
-
-| Byte | IBM500 | IBM1047 |
-|------|--------|---------|
-| 0x25 | LF (U+000A) | NEL (U+0085) — the z/OS Unix line terminator |
-| 0x4F | `!` | `\|` |
-| 0x5A | `]` | `!` |
-| 0x4A | `[` | `¢` |
-| 0xAD | `Ý` | `[` |
-| 0xB0 | `¢` | `¬` |
-| 0xBA | `¬` | `Ý` |
-| 0xBB | `\|` | `¨` |
-| 0xBD | `¨` | `]` |
-
-Five of the nine differing byte positions are below 0x80 and therefore
-invisible to stride-1 (high-byte-anchored) features. They are only visible
-through stride-2 bigrams at even byte positions. The exclamation mark (`!`)
-is the most frequent distinguishing character in prose text: it encodes to
-byte 0x4F in IBM500 and 0x5A in IBM1047. In typical Wikipedia prose, `!`
-appears roughly 0.3–0.5% of the time — about **1 occurrence per 256-byte
-sample**, with a 50% chance of landing at an even stride-2 position.
-
-At 20 000 samples this gives the model approximately 20 000 × 0.5 = 10 000
-observations of the key distinguishing bigram — statistically sufficient in
-theory, but in practice the signal-to-noise ratio is low because the same
-stride-2 bucket receives contributions from many other byte pairs that happen
-to hash to it. Empirically, training at 20 000 samples/class and 16 384
-buckets produced:
-
-```
-IBM500 in-sample accuracy: 68.5%
-IBM1047 in-sample accuracy: 52.0%
-```
-
-Increasing to 32 768 buckets (reducing hash collision noise) changed the
-balance dramatically:
-
-```
-IBM500 in-sample accuracy: 28.5% ← model now over-predicts IBM1047
-IBM1047 in-sample accuracy: 91.1%
-```
-
-This reveals that the 16 384-bucket model is not bucket-limited — it is
-**data-limited for this specific pair**. More buckets give the model enough
-capacity to memorise the IBM1047 signal and then swing hard in that direction,
-at the cost of IBM500 accuracy. The solution is more training samples (so the
-frequency estimates become more reliable) rather than more buckets. Bumping
-to **50 000 samples** at 16 384 buckets brings IBM1047 to ~91% without
-collapsing IBM500 (see the bucket-size note in the next section).
-
-**IBM437 vs IBM850 — superset/subset pair**
-
-IBM850 is a superset of IBM437. The characters that differ are in the
-`0xB0–0xFF` range, where IBM437 uses box-drawing characters and IBM850 uses
-Latin extended characters. Box-drawing characters never appear in prose text,
-so the quality gate (high-byte ≥ 2%) combined with the IGNORE encoder means
-that the training samples for IBM437 and IBM850 contain nearly identical
-high-byte distributions for most sentences. This is a fundamental signal
-limitation, not a sample-count limitation (see [Known
Limitations](#known-limitations)).
-
-### Why devtest is 2 000
-
-Devtest is used for **model selection** — choosing between bucket counts,
-feature flag combinations, and learning-rate schedules during the simulated
-annealing search in `anneal.py`. The expected accuracy differences between
-candidate configurations are in the range 0.5–3 percentage points. At 2 000
-samples per charset (76 000 total for 38 trained classes), the 95% confidence
-interval on a binomial proportion at 90% accuracy is ±1.4 pp — sufficient to
-distinguish configurations that differ by more than noise.
-
-More than 2 000 devtest samples per charset would lengthen each annealing
-trial without improving selection quality, since the variance across annealing
-trials (caused by SGD randomness) dominates over the devtest estimation error.
-
-### Why test is 5 000 for CJK/Unicode and 10 000 for SBCS/EBCDIC
-
-The held-out test set is evaluated **once** after all hyperparameter choices
-are locked in. The precision required differs by charset family:
-
-- **CJK/Unicode**: accuracy is consistently above 95%. At 5 000 samples the
- 95% CI on a 99% accuracy measurement is ±0.3 pp — more than adequate.
-
-- **SBCS/EBCDIC**: the hard confusable pairs (IBM500/IBM1047, IBM850/IBM437,
- windows-1250/windows-1252) produce accuracy in the 50–95% range. At 5 000
- samples the CI at 90% accuracy is ±0.8 pp. Doubling to **10 000 samples**
- tightens the CI to ±0.6 pp and, more importantly, gives enough samples to
- compute meaningful **per-pair confusion matrices** (e.g., how often IBM1047
- is predicted for an IBM500 file) — which are the primary diagnostic for
- deciding whether confusable pairs should be added to `CharsetConfusables`.
-
-### Known limitations
-
-**IBM500 vs IBM1047** and **IBM437 vs IBM850** remain genuinely difficult to
-distinguish from Wikipedia prose text alone:
-
-- The distinguishing bytes are either low (< 0x80, invisible to stride-1
- features) or map to characters that rarely appear in prose.
-- The quality gate that rejects unencodable sentences inadvertently filters
- out the most discriminating sentences for the IBM437/IBM850 pair.
-
-Planned mitigations:
-
-1. Add both pairs to `CharsetConfusables` so that predicting either member
- of a pair counts as a soft hit in evaluation and downstream decoding
- remains correct.
-2. Supplement training data with technical/code-like text (shell scripts, C
- source) where `!`, `[`, `]`, `|` appear at higher frequency, strengthening
- the stride-2 bigram signal for IBM500/IBM1047.
-3. Consider merging IBM437 into IBM850 as a single "DOS Western European"
- class, since IBM850 is the superset and correctly decodes all IBM437
content.
-
-### Bucket size sweep
-
-Earlier evaluation across bucket sizes on the 34-class corpus showed negligible
-accuracy differences, confirming the feature space is not the primary
bottleneck:
-
-| Buckets | Model size | Strict (full) | Soft (full) |
-|---|---|---|---|
-| 65 536 | 2.0 MB | 70.6% | 85.6% |
-| 32 768 | 1.0 MB | 70.9% | 85.6% |
-| 16 384 | 512 KB | 71.2% | 85.6% |
-| 8 192 | 257 KB | 70.9% | 85.6% |
-| 4 096 | 129 KB | 70.3% | 85.5% |
-
-(ML-All column, 34-class held-out test set at full probe length.)
-
-The **current production model** uses **16 384 buckets** (38 classes, ~1.02
MB).
-See [Training Data Size Rationale](#training-data-size-rationale) for why 32
768
-buckets is not recommended despite the larger corpus — for the hardest
confusable
-pairs (IBM500/IBM1047) more buckets cause the model to overfit one label at the
-expense of the other rather than improving overall accuracy.
-
----
-
-## Evaluation Results
-
-Held-out test set: **5 000 samples/charset**, 36 charsets, generated from
-MADLAD-400 + Cantonese Wikipedia (zh_yuewiki). Big5/Big5-HKSCS samples come
-exclusively from zh_yuewiki; all other charsets from MADLAD. All splits are
-genuine held-out — no training sentences appear in devtest or test.
-
-**R%** = strict accuracy (exact charset name match).
-**S%** = soft accuracy (exact or confusable-group match).
-
-```
-=== Probe length: 20B ===
-Charset N | Stat R% S% | +ISO R% S% | +CJK R% S% |
All R% S% | Pipe R% S% | ICU4J R% S% | juniv R% S% |
--------------------------------------------------------------------------------------------------------------------------------------------
-Big5-HKSCS 5000 | 77.0 99.7 | 74.7 96.4 | 74.7 96.4 |
74.7 96.4 | 74.7 96.4 | 0.0 97.6 | 0.0 71.2 |
-Big5 5000 | 67.9 99.3 | 64.9 95.5 | 64.9 95.5 |
64.9 95.5 | 64.9 95.5 | 97.3 97.3 | 98.0 98.0 |
-EUC-JP 5000 | 94.5 94.5 | 94.5 94.5 | 94.5 94.5 |
94.5 94.5 | 94.5 94.5 | 83.8 83.8 | 91.2 91.2 |
-EUC-KR 5000 | 91.6 91.6 | 91.6 91.6 | 91.6 91.6 |
91.6 91.6 | 91.6 91.6 | 92.0 92.0 | 91.7 91.7 |
-GB18030 5000 | 11.6 99.8 | 11.6 99.8 | 11.6 99.8 |
17.0 99.8 | 17.0 99.8 | 99.4 99.4 | 99.9 99.9 |
-GBK 5000 | 76.7 99.9 | 76.7 99.9 | 76.7 99.9 |
76.7 99.9 | 76.7 99.9 | 0.0 99.4 | 0.0 99.8 |
-...
-UTF-16-BE 5000 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 |
0.0 0.0 | 97.8 97.8 | 69.9 69.9 | 0.4 0.4 |
-UTF-16-LE 5000 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 |
0.0 0.0 | 98.1 98.1 | 69.5 69.5 | 0.5 0.5 |
-UTF-32-BE 5000 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 |
0.0 0.0 | 99.8 99.8 | 100.0 100.0 | 0.6 0.6 |
-UTF-32-LE 5000 | 0.0 0.0 | 0.0 0.0 | 0.0 0.0 |
0.0 0.0 | 99.8 99.8 | 100.0 100.0 | 0.6 0.6 |
--------------------------------------------------------------------------------------------------------------------------------------------
-OVERALL 179470 | 41.2 48.5 | 48.1 61.8 | 48.1 61.8 |
48.1 61.8 | 50.0 61.2 | 25.8 38.6 | 32.2 43.8 |
- µs/sample | 11.1 | 9.7 | 9.7 |
9.6 | 7.2 | 11.4 | 4.0 |
-
-=== Probe length: 100B ===
-OVERALL 179470 | 62.0 73.6 | 68.0 82.8 | 68.0 82.8 |
68.0 82.8 | 72.4 85.1 | 46.8 69.9 | 40.1 55.7 |
- µs/sample | 8.3 | 7.8 | 7.5 |
7.4 | 5.5 | 36.1 | 6.1 |
-
-=== Probe length: full ===
-OVERALL 179470 | 65.3 77.4 | 70.7 85.6 | 70.7 85.6 |
70.9 85.6 | 78.0 91.4 | 51.7 74.0 | 40.7 56.5 |
- µs/sample | 12.5 | 12.6 | 10.8 |
10.8 | 8.5 | 167.1 | 18.3 |
-```
-
-### Notes on specific charsets
-
-**UTF-16/32 — 0% strict in `All`, 98–100% in `Pipeline`**: the ML model has
-no structural mechanism to distinguish wide encodings; `WideUnicodeDetector`
-handles them before ML is invoked. The `Pipeline` column is the correct metric
-for production use.
-
-**Pipeline is fastest**: `WideUnicodeDetector` short-circuits ~11% of samples
-(the UTF-16/32 ones) before the ML model runs, reducing average latency below
-any single-detector configuration.
-
-**ICU4J at 167 µs/full vs Pipeline at 8.5 µs/full**: ICU4J's statistical
-tables require a full-probe pass with complex byte-frequency analysis; the ML
-model uses sparse hash lookups over high bytes only.
-
-**GB18030 low strict / high soft**: the GB family is a strict subset chain
-(GB2312 ⊂ GBK ⊂ GB18030). When ML predicts GBK for a GB18030 file, the text
-decodes correctly unless the file uses GB18030-specific 4-byte sequences (rare
-minority-language characters). The `GB_FOUR_BYTE_UPGRADE` rule catches those
-cases. ICU4J's 99% strict score for GB18030 reflects grammar-based rules; the
-practical decoding difference is small for typical Han Chinese content.
-
-**Big5 vs Big5-HKSCS**: both charsets encode Cantonese Wikipedia identically
-at the byte level (the HKSCS extension characters are rare in typical text).
-The model learns to distinguish them from subtle frequency differences. When
-it predicts Big5 for a Big5-HKSCS file, the only risk is HKSCS extension
-characters being mis-decoded; predicting Big5-HKSCS for a Big5 file is always
-safe (superset decodes the subset correctly).
-
-**IBM424 0% strict / ~100% soft**: the structural rule detects EBCDIC Hebrew
-correctly but does not determine directionality (ltr vs rtl), so the result is
-always the confusable-group partner — a soft hit.
-
----
-
-## Source Language → Charset Mapping
+| [MADLAD-400](https://arxiv.org/abs/2309.04662) | Primary training corpus
(400+ languages, CC licensed) |
+| Cantonese Wikipedia (`zh_yuewiki`) | Big5 / Big5-HKSCS training data
(Traditional Chinese) |
-The pipeline maps ISO 639-3 language codes to applicable legacy charsets.
-Key decisions:
+## Supported charsets
-- **Big5 / Big5-HKSCS**: sourced from Cantonese Wikipedia (`yue` virtual
- language). MADLAD Simplified Chinese is excluded because simplified
- characters are generally not encodable in plain Big5.
-- **IBM424**: Hebrew (`heb`) only, both LTR (logical) and RTL (visual/reversed)
- variants use the same IBM424 code page.
-- **IBM500**: all languages that can encode into Latin-1 (English, French,
- German, Spanish, Dutch, etc.) — IBM500 is a Latin EBCDIC code page.
-- **UTF-8 / UTF-16 / UTF-32**: all languages contribute.
-- **US-ASCII**: English (`eng`) only.
-- **ISO-8859-3**: Maltese (`mlt`) — the only charset for which ħ, ġ, ċ, ż are
- representable without a Windows equivalent.
+37 direct model labels covering CJK multibyte, Unicode (UTF-8/16/32),
+EBCDIC variants (IBM500/1047/424/420/850/852/855/866), Cyrillic (KOI8-R/U,
+windows-1251, IBM855, x-mac-cyrillic), Arabic (windows-1256), Thai (TIS-620),
+Vietnamese (windows-1258), and all major Windows-12XX / ISO-8859-X families.
+ISO-2022-JP/KR/CN and HZ-GB-2312 are detected via structural rules before the
+model runs.
diff --git
a/tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/tools/BuildCharsetTrainingData.java
b/tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/tools/BuildCharsetTrainingData.java
index 3c54343489..81f8f79636 100644
---
a/tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/tools/BuildCharsetTrainingData.java
+++
b/tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/tools/BuildCharsetTrainingData.java
@@ -275,6 +275,19 @@ public class BuildCharsetTrainingData {
"US-ASCII", "ISO-2022-JP", "ISO-2022-KR", "ISO-2022-CN",
"x-ISO-2022-CN-CNS"
));
+ /**
+ * Charsets that must not appear in train, devtest, or test splits either
+ * because they are confusable aliases for a trained label (IBM437 →
IBM850)
+ * or because they are structural-only charsets whose test data was
generated
+ * before the structural-only category was established (x-ISO-2022-CN-CNS).
+ * The eval tool mirrors this set via {@code DEFAULT_EXCLUDE} so neither
+ * charset produces misleading 0% strict rows.
+ */
+ private static final Set<String> CONFUSABLE_ALIAS = new
HashSet<>(Arrays.asList(
+ "IBM437", // box-drawing bytes never appear in prose;
IBM850 is the trained label
+ "x-ISO-2022-CN-CNS" // structural-only; detected by ISO-2022
escape gates, not the model
+ ));
+
/**
* Charsets exempt from the high-byte ratio gate. UTF-16/32 have a
* variable mix of zero and non-zero bytes depending on script; applying
diff --git
a/tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/tools/EvalCharsetDetectors.java
b/tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/tools/EvalCharsetDetectors.java
index 5f5bbb06b5..9d101ef146 100644
---
a/tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/tools/EvalCharsetDetectors.java
+++
b/tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/tools/EvalCharsetDetectors.java
@@ -27,6 +27,7 @@ import java.util.ArrayList;
import java.util.Arrays;
import java.util.EnumSet;
import java.util.HashMap;
+import java.util.HashSet;
import java.util.List;
import java.util.Locale;
import java.util.Map;
@@ -84,12 +85,24 @@ public class EvalCharsetDetectors {
// Index of the "All" detector — used for confusion matrix and score-only
output
private static final int IDX_ALL = 3;
+ /**
+ * Charsets present in the test directory but not as direct model labels —
+ * either confusable aliases for a trained label or structural-only
charsets
+ * whose test data was generated before the structural-only distinction was
+ * introduced. Skipped in per-row reporting to avoid misleading 0%
numbers.
+ */
+ private static final Set<String> DEFAULT_EXCLUDE = Set.of(
+ "IBM437", // superset IBM850 is the trained label; IBM437
is a confusable alias
+ "x-ISO-2022-CN-CNS" // structural-only (ISO-2022 escape gates); no
ML model label
+ );
+
public static void main(String[] args) throws Exception {
Path modelPath = null;
Path dataDir = null;
int[] probeLengths = {FULL_LENGTH};
boolean showConfusion = false;
boolean scoreOnly = false;
+ Set<String> exclude = new HashSet<>(DEFAULT_EXCLUDE);
for (int i = 0; i < args.length; i++) {
switch (args[i]) {
@@ -108,6 +121,9 @@ public class EvalCharsetDetectors {
case "--score-only":
scoreOnly = true;
break;
+ case "--exclude":
+ exclude.addAll(Arrays.asList(args[++i].split(",")));
+ break;
default:
System.err.println("Unknown argument: " + args[i]);
System.exit(1);
@@ -116,7 +132,7 @@ public class EvalCharsetDetectors {
if (dataDir == null) {
System.err.println(
"Usage: EvalCharsetDetectors [--model <path>] --data <dir>"
- + " [--lengths 20,50,100,full] [--confusion]");
+ + " [--lengths 20,50,100,full] [--confusion] [--exclude
cs1,cs2]");
System.exit(1);
}
@@ -148,6 +164,9 @@ public class EvalCharsetDetectors {
List<List<byte[]>> allSamplesPerCharset = new ArrayList<>();
for (Path f : testFiles) {
String cs = f.getFileName().toString().replaceAll("\\.bin\\.gz$",
"");
+ if (exclude.contains(cs)) {
+ continue;
+ }
List<byte[]> samples = loadSamples(f);
if (!samples.isEmpty()) {
charsets.add(cs);