This is an automated email from the ASF dual-hosted git repository.
tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 2ce13d8799 make mojibuster default (#2882)
2ce13d8799 is described below
commit 2ce13d87994e80cd3dae7073d5ad79d246b45084
Author: Tim Allison <[email protected]>
AuthorDate: Tue Jun 9 21:50:19 2026 +0200
make mojibuster default (#2882)
---
.../pages/configuration/encoding-detectors.adoc | 202 ++++++++++++---------
.../tika/detect/DefaultEncodingDetector.java | 55 ++----
.../org.apache.tika.detect.EncodingDetector | 9 +-
.../org.apache.tika.detect.EncodingDetector | 3 +-
.../org.apache.tika.detect.EncodingDetector | 4 +-
.../ml/chardetect/ZipFilenameDetectionTest.java | 3 -
.../org.apache.tika.detect.EncodingDetector | 3 +-
.../org.apache.tika.detect.EncodingDetector | 5 +-
.../tika/config/TikaEncodingDetectorTest.java | 119 ++++++++----
.../apache/tika/parser/AutoDetectParserTest.java | 4 +-
.../parser/html/HtmlEncodingDetectionTest.java | 2 +-
.../tika/parser/microsoft/rtf/RTFParserTest.java | 7 +-
.../org/apache/tika/parser/pdf/PDFParserTest.java | 2 +-
.../TIKA-2485-encoding-detector-mark-limits.json | 4 +-
.../tika-parser-html-module/pom.xml | 9 +-
.../apache/tika/parser/html/HtmlParserTest.java | 15 +-
.../tika-parser-mail-module/pom.xml | 11 +-
.../tika-parser-microsoft-module/pom.xml | 13 +-
.../microsoft/POIContainerExtractionTest.java | 2 +-
.../tika-parser-miscoffice-module/pom.xml | 17 +-
.../tika-parser-text-module/pom.xml | 5 +-
.../tika/parser/csv/TextAndCSVParserTest.java | 12 +-
.../org/apache/tika/parser/txt/TXTParserTest.java | 37 ++--
.../configs/tika-config-ignore-charset.json | 13 --
.../tika-parsers-standard-package/pom.xml | 4 +-
25 files changed, 304 insertions(+), 256 deletions(-)
diff --git a/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc
b/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc
index 9ec15dc5a6..6a3c407cf5 100644
--- a/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc
+++ b/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc
@@ -18,44 +18,61 @@
= Configuring Encoding Detectors
Tika uses a chain of _encoding detectors_ to determine the character encoding
-of plain text and HTML content. `DefaultEncodingDetector` loads detectors via
-the Java service-provider interface (SPI) and runs them in registration order;
-the first non-null result wins.
+of plain text and HTML content. `DefaultEncodingDetector` discovers detectors
+via the Java service-provider interface (SPI, `META-INF/services`).
-The default chain is `html-encoding-detector`, `universal-encoding-detector`,
-and `icu4j-encoding-detector`.
+The chain runs in one of two modes:
+
+* *collect-all* — when a `MetaEncodingDetector` is present (the 4.x default
+ includes one), every base detector runs and contributes candidate encodings,
+ then the meta detector picks the best one by decode quality. Registration
+ order does not matter.
+* *first-match-wins* — otherwise, detectors run in registration order and the
+ first non-null result is used.
== Default Detection Chain
-With the stock dependencies on the classpath (the modules
-`tika-encoding-detector-html`, `tika-encoding-detector-universal`, and
-`tika-encoding-detector-icu4j`):
+The stock 4.x distribution registers five detectors:
[cols="1,2,3"]
|===
-|Step |Detector |Returns non-null when…
+|Detector |Module |Role
+
+|`bom-detector`
+|`tika-core`
+|Emits a candidate from a leading byte-order mark.
+
+|`metadata-charset-detector`
+|`tika-core`
+|Emits a candidate from declarative hints (`Content-Type` charset,
+`Content-Encoding`) in the `Metadata` object.
-|1
|`html-encoding-detector`
-|An HTML `<meta charset="…">` or `<meta http-equiv="Content-Type">` tag is
-found. Fast lenient regex matcher with a curated subset of WHATWG label
-aliases.
+|`tika-encoding-detector-html`
+|Emits a candidate from an HTML `<meta charset>` / `http-equiv` tag (lenient
+regex over a curated subset of WHATWG label aliases).
-|2
-|`universal-encoding-detector`
-|A state-machine structural prober (juniversalchardet fork) recognises the
-byte pattern as a known encoding (UTF-8, GB18030, Big5, EUC-JP, several
-ISO-8859 variants, etc.).
+|`mojibuster-encoding-detector`
+|`tika-encoding-detector-mojibuster`
+|Byte-bigram Naive Bayes classifier plus structural detectors for UTF-32 and
+UTF-16 and a UTF-8 grammar gate.
-|3
-|`icu4j-encoding-detector`
-|ICU4J's `CharsetDetector` returns a match. Catches additional single-byte
-encodings (Windows code pages, IBM/EBCDIC variants, etc.).
+|`junk-filter-encoding-detector`
+|`tika-ml-junkdetect`
+|`MetaEncodingDetector` that picks among the other detectors' candidates by
+script-aware decode quality. Always runs last.
|===
-The chain is permissive — first-match-wins. A declared charset
-(e.g. from a `<meta charset>` tag) wins over later structural or statistical
-detectors.
+Because `junk-filter-encoding-detector` is a `MetaEncodingDetector`, the chain
+runs collect-all: detector order is irrelevant, and a declaration (a BOM or a
+`<meta charset>` tag) does *not* automatically win. The junk filter will
+override a declaration — or even a BOM — when the byte evidence strongly
+contradicts it.
+
+NOTE: This is a behaviour change from 3.x, whose default chain was
+`html` / `universal` / `icu4j` with first-match-wins (a declaration always
+won). `universal-encoding-detector` and `icu4j-encoding-detector` are no
longer
+in the default distribution; see <<restore-3x,Restore the 3.x chain>>.
== Available Detectors
@@ -66,49 +83,45 @@ referenced by their SPI name in JSON configuration.
|===
|Name |Module |Description
-|`html-encoding-detector`
-|`tika-encoding-detector-html`
-|Fast lenient regex matcher for `<meta charset>` / `http-equiv` tags, with a
-curated subset of WHATWG label aliases. Auto-registered (in default chain).
-
-|`universal-encoding-detector`
-|`tika-encoding-detector-universal`
-|State-machine structural prober (juniversalchardet fork). Auto-registered
-(in default chain).
+|`bom-detector`
+|`tika-core`
+|Reads a leading byte-order mark. In the default chain.
-|`icu4j-encoding-detector`
-|`tika-encoding-detector-icu4j`
-|Wraps ICU4J's `CharsetDetector`. Auto-registered (in default chain).
+|`metadata-charset-detector`
+|`tika-core`
+|Reads declarative hints (`Content-Type` charset, `Content-Encoding`) from the
+`Metadata` object. In the default chain.
-|`standard-html-encoding-detector`
+|`html-encoding-detector`
|`tika-encoding-detector-html`
-|Spec-strict WHATWG prescan algorithm. Not in the default chain — opt in
-explicitly if you need strict WHATWG tokenisation (e.g. ignoring charset
-declarations inside HTML comments or other contexts the lenient regex may
-match).
+|Fast lenient regex matcher for `<meta charset>` / `http-equiv` tags. In the
+default chain.
|`mojibuster-encoding-detector`
|`tika-encoding-detector-mojibuster`
-|Byte-bigram Naive Bayes classifier plus structural detectors for UTF-32
-and UTF-16 and a UTF-8 grammar gate. Not in the default chain — opt in
-explicitly.
+|Byte-bigram Naive Bayes classifier with structural UTF-32/UTF-16 detectors and
+a UTF-8 grammar gate. In the default chain.
|`junk-filter-encoding-detector`
|`tika-ml-junkdetect`
-|Text-quality arbitrator (`MetaEncodingDetector`) that picks among other
-detectors' candidates by decode quality. Not in the default chain — opt in
-explicitly.
+|Text-quality arbitrator (`MetaEncodingDetector`). In the default chain; runs
+last.
-|`bom-detector`
-|`tika-core`
-|Reads the first 4 bytes for BOM signatures. Helper component, used
-internally by `AutoDetectReader`. Not normally added to the SPI chain.
+|`standard-html-encoding-detector`
+|`tika-encoding-detector-html`
+|Spec-strict WHATWG prescan algorithm. Not in the default chain — opt in if
you
+need strict WHATWG tokenisation (e.g. ignoring charset declarations inside HTML
+comments).
-|`metadata-charset-detector`
-|`tika-core`
-|Reads declarative hints (`Content-Type` charset, `Content-Encoding`) from
-the `Metadata` object. Helper component, used by parsers that consult
-`Content-Type` directly. Not normally added to the SPI chain.
+|`universal-encoding-detector`
+|`tika-encoding-detector-universal`
+|State-machine structural prober (juniversalchardet fork). Not bundled and not
+auto-discovered; add the jar and configure it explicitly to use it.
+
+|`icu4j-encoding-detector`
+|`tika-encoding-detector-icu4j`
+|Wraps ICU4J's `CharsetDetector`. Not bundled and not auto-discovered; add the
+jar and configure it explicitly to use it.
|===
== Configuration Examples
@@ -124,83 +137,102 @@ auto-registered detectors:
"encoding-detectors": [
{
"default-encoding-detector": {
- "exclude": ["icu4j-encoding-detector"]
+ "exclude": ["html-encoding-detector"]
}
}
]
}
----
+NOTE: Do not combine `default-encoding-detector` with other explicit detector
+entries in the same list. When combined, the loader wraps everything in an
+outer composite that has no `MetaEncodingDetector` at its top level, so
+collect-all arbitration is silently lost and the explicit detectors are never
+reached. Use an explicit chain (see below) when you need to configure
+individual detectors.
+
=== Specify the chain explicitly
-To replace the SPI-discovered chain with an explicit ordered list:
+To replace the SPI-discovered chain with an explicit ordered list. Include
+`junk-filter-encoding-detector` (last) to keep collect-all arbitration; omit it
+for first-match-wins:
[source,json]
----
{
"encoding-detectors": [
{"html-encoding-detector": {}},
- {"universal-encoding-detector": {}}
+ {"mojibuster-encoding-detector": {}},
+ {"junk-filter-encoding-detector": {}}
]
}
----
=== Configure the HTML detector's read limit
-`html-encoding-detector` reads up to 65 536 bytes by default when scanning
-for the `<meta charset>` tag. Raise it if your documents embed large
-`<script>` blocks before the meta tag (TIKA-2485):
+`html-encoding-detector` reads up to 65 536 bytes by default when scanning for
+the `<meta charset>` tag. Raise it if your documents embed large `<script>`
+blocks before the meta tag (TIKA-2485). (`mojibuster-encoding-detector` reads
a
+larger content probe, so in the default chain this limit matters mainly for
very
+large preambles.)
+
+To configure `markLimit`, specify the full chain explicitly. An explicit list
+that includes `junk-filter-encoding-detector` keeps collect-all arbitration;
the
+configured `html-encoding-detector` participates as a base detector alongside
+Mojibuster, and the junk filter arbitrates as usual:
[source,json]
----
{
"encoding-detectors": [
- {
- "html-encoding-detector": {
- "markLimit": 131072
- }
- },
- {"universal-encoding-detector": {}},
- {"icu4j-encoding-detector": {}}
+ {"html-encoding-detector": {"markLimit": 131072}},
+ {"mojibuster-encoding-detector": {}},
+ {"junk-filter-encoding-detector": {}}
]
}
----
=== Use the spec-strict WHATWG HTML detector
-If your input HTML has charset declarations inside comments (or other
-contexts where the lenient regex would false-match), opt in to the
-spec-strict prescan:
+If your input HTML has charset declarations inside comments (or other contexts
+where the lenient regex would false-match), opt in to the spec-strict prescan:
[source,json]
----
{
"encoding-detectors": [
{"standard-html-encoding-detector": {}},
- {"universal-encoding-detector": {}},
- {"icu4j-encoding-detector": {}}
+ {"mojibuster-encoding-detector": {}},
+ {"junk-filter-encoding-detector": {}}
]
}
----
-=== Add the Mojibuster + JunkFilter chain (opt-in)
+[#restore-3x]
+=== Restore the 3.x detection chain (universal + icu4j)
-The byte-bigram NB classifier (`mojibuster-encoding-detector`) and the
-text-quality arbitrator (`junk-filter-encoding-detector`) are available as
-opt-in components. They require the `tika-encoding-detector-mojibuster`
-and `tika-ml-junkdetect` modules on the classpath:
+The 4.x default no longer bundles or auto-registers
+`universal-encoding-detector` and `icu4j-encoding-detector`. To get the legacy
+3.x behaviour (`html` / `universal` / `icu4j`, first-match-wins) you must do
+*both*:
+
+. *Add the jars to the classpath.* They are no longer in the `tika-app` /
+`tika-server-standard` packages, so supply `tika-encoding-detector-universal`
+and `tika-encoding-detector-icu4j` yourself (for example via
`-Dtika.extras.dir`
+— see xref:configuration/index.adoc[the configuration overview]).
+. *Configure the chain explicitly.* An explicit chain with no
+`MetaEncodingDetector` runs first-match-wins:
[source,json]
----
{
"encoding-detectors": [
{"html-encoding-detector": {}},
- {"mojibuster-encoding-detector": {}},
- {"junk-filter-encoding-detector": {}}
+ {"universal-encoding-detector": {}},
+ {"icu4j-encoding-detector": {}}
]
}
----
-`junk-filter-encoding-detector` is a `MetaEncodingDetector` — it collects
-candidates from the other detectors and picks the cleanest decoding via a
-script-aware text-quality model. It must run last.
+Dropping the jars on the classpath alone is *not* enough: unlike the other
+detectors, these two are config-only and are not auto-discovered via SPI.
diff --git
a/tika-core/src/main/java/org/apache/tika/detect/DefaultEncodingDetector.java
b/tika-core/src/main/java/org/apache/tika/detect/DefaultEncodingDetector.java
index 932f2f05cd..dead1deced 100644
---
a/tika-core/src/main/java/org/apache/tika/detect/DefaultEncodingDetector.java
+++
b/tika-core/src/main/java/org/apache/tika/detect/DefaultEncodingDetector.java
@@ -17,74 +17,41 @@
package org.apache.tika.detect;
import java.util.Collection;
-import java.util.Comparator;
-import java.util.HashMap;
-import java.util.List;
-import java.util.Map;
import javax.imageio.spi.ServiceRegistry;
import org.apache.tika.config.ServiceLoader;
/**
- * A composite encoding detector based on all the {@link EncodingDetector}
+ * A composite encoding detector over all {@link EncodingDetector}
* implementations available through the
* {@link ServiceRegistry service provider mechanism}.
*
- * <p>The default chain (Tika 3.x style) runs three detectors in order, with
- * the first non-empty result winning:
- * <ol>
- * <li>{@code org.apache.tika.parser.html.HtmlEncodingDetector}</li>
- * <li>{@code org.apache.tika.parser.txt.UniversalEncodingDetector}</li>
- * <li>{@code org.apache.tika.parser.txt.Icu4jEncodingDetector}</li>
- * </ol>
- * Any other {@link EncodingDetector} discovered via SPI (e.g.,
- * user-supplied detectors) runs after the three blessed detectors,
- * preserving back-compat for callers who add their own.</p>
+ * <p>The 4.x default chain (via {@code META-INF/services}): BOM and
+ * metadata-charset detectors (tika-core), the HTML {@code <meta>} detector,
+ * MojibusterEncodingDetector, and JunkFilterEncodingDetector — a
+ * {@link MetaEncodingDetector} that arbitrates the candidates by decode
quality
+ * and always runs last, so detector order is irrelevant.</p>
*
- * <p>If you need to control the order of the Detectors explicitly, construct
- * your own {@link CompositeEncodingDetector} and pass in the list in the
- * required order.</p>
+ * <p>UniversalEncodingDetector and Icu4jEncodingDetector are no longer
+ * distributed by default and are not auto-discovered; add the jar and enable
+ * them in a {@code tika-config} to use them.</p>
*
* @since Apache Tika 1.15
*/
public class DefaultEncodingDetector extends CompositeEncodingDetector {
- /** Pinned ordering for the 3.x-style default chain. Detectors not on this
- * map keep their natural SPI load order behind the three blessed ones. */
- private static final Map<String, Integer> PRIORITY = buildPriority();
-
- private static Map<String, Integer> buildPriority() {
- Map<String, Integer> p = new HashMap<>();
- p.put("org.apache.tika.parser.html.HtmlEncodingDetector", 0);
- p.put("org.apache.tika.parser.txt.UniversalEncodingDetector", 1);
- p.put("org.apache.tika.parser.txt.Icu4jEncodingDetector", 2);
- return p;
- }
-
public DefaultEncodingDetector() {
this(new
ServiceLoader(DefaultEncodingDetector.class.getClassLoader()));
}
public DefaultEncodingDetector(ServiceLoader loader) {
- super(sorted(loader.loadServiceProviders(EncodingDetector.class)));
+ super(loader.loadServiceProviders(EncodingDetector.class));
}
public DefaultEncodingDetector(ServiceLoader loader,
Collection<Class<? extends
EncodingDetector>>
excludeEncodingDetectors) {
- super(sorted(loader.loadServiceProviders(EncodingDetector.class)),
+ super(loader.loadServiceProviders(EncodingDetector.class),
excludeEncodingDetectors);
}
-
- private static List<EncodingDetector> sorted(List<EncodingDetector>
detectors) {
- // Pin the 3.x default chain (html, universal, icu4j) to fixed
- // positions; other detectors fall to the end with stable secondary
- // ordering by class name.
- detectors.sort(Comparator
- .<EncodingDetector, Integer>comparing(
- d -> PRIORITY.getOrDefault(
- d.getClass().getName(), Integer.MAX_VALUE))
- .thenComparing(d -> d.getClass().getName()));
- return detectors;
- }
}
diff --git
a/tika-core/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
b/tika-core/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
index 2970322e6e..b72636cc27 100644
---
a/tika-core/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
+++
b/tika-core/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
@@ -13,9 +13,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-# Intentionally empty: tika-core itself does not register any default
-# EncodingDetector implementations. The default chain is provided by the
-# tika-encoding-detector-html, tika-encoding-detector-universal, and
-# tika-encoding-detector-icu4j modules and is sequenced by
-# DefaultEncodingDetector.
+# tika-core's part of the default 4.x chain; html, mojibuster, and the
+# junk-filter arbitrator register from their own modules.
+org.apache.tika.detect.BOMDetector
+org.apache.tika.detect.MetadataCharsetDetector
diff --git
a/tika-encoding-detectors/tika-encoding-detector-icu4j/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
b/tika-encoding-detectors/tika-encoding-detector-icu4j/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
index 6283ea152d..d00f541c45 100644
---
a/tika-encoding-detectors/tika-encoding-detector-icu4j/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
+++
b/tika-encoding-detectors/tika-encoding-detector-icu4j/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
@@ -12,4 +12,5 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-org.apache.tika.parser.txt.Icu4jEncodingDetector
+
+# Not in the default 4.x chain; enable in tika-config to use.
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
index 22e3b25428..dabb7ab55b 100644
---
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
+++
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
@@ -13,6 +13,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-# Intentionally empty: MojibusterEncodingDetector is no longer part of the
-# default Tika encoding-detection chain. Users who want it must register it
-# explicitly via tika-config.
+org.apache.tika.ml.chardetect.MojibusterEncodingDetector
diff --git
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/ZipFilenameDetectionTest.java
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/ZipFilenameDetectionTest.java
index 17a84dd9a8..73f7dfe5d6 100644
---
a/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/ZipFilenameDetectionTest.java
+++
b/tika-encoding-detectors/tika-encoding-detector-mojibuster/src/test/java/org/apache/tika/ml/chardetect/ZipFilenameDetectionTest.java
@@ -20,7 +20,6 @@ import static org.junit.jupiter.api.Assertions.assertTrue;
import java.util.List;
-import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import org.apache.tika.detect.DefaultEncodingDetector;
@@ -59,7 +58,6 @@ public class ZipFilenameDetectionTest {
* sequentially on two entries differing only in byte 5 (0x31 vs 0x32),
simulating
* what ZipParser does when iterating entries with the same ParseContext.
*/
- @Disabled("TIKA-4683: rolled-back chain (Html, Universal, Icu4j);
Mojibuster no longer in default chain.")
@Test
public void fullPipelineDetectsBothSjisEntries() throws Exception {
DefaultEncodingDetector detector = new DefaultEncodingDetector();
@@ -80,7 +78,6 @@ public class ZipFilenameDetectionTest {
/**
* Full pipeline should detect GBK-encoded entry names as GB18030.
*/
- @Disabled("TIKA-4683: rolled-back chain (Html, Universal, Icu4j);
Mojibuster no longer in default chain.")
@Test
public void fullPipelineDetectsGbkEntry() throws Exception {
DefaultEncodingDetector detector = new DefaultEncodingDetector();
diff --git
a/tika-encoding-detectors/tika-encoding-detector-universal/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
b/tika-encoding-detectors/tika-encoding-detector-universal/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
index 2982e2584e..d00f541c45 100644
---
a/tika-encoding-detectors/tika-encoding-detector-universal/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
+++
b/tika-encoding-detectors/tika-encoding-detector-universal/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
@@ -12,4 +12,5 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-org.apache.tika.parser.txt.UniversalEncodingDetector
+
+# Not in the default 4.x chain; enable in tika-config to use.
diff --git
a/tika-encoding-detectors/tika-encoding-detector-universal/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
b/tika-ml/tika-ml-junkdetect/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
similarity index 82%
copy from
tika-encoding-detectors/tika-encoding-detector-universal/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
copy to
tika-ml/tika-ml-junkdetect/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
index 2982e2584e..d6d36cd477 100644
---
a/tika-encoding-detectors/tika-encoding-detector-universal/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
+++
b/tika-ml/tika-ml-junkdetect/src/main/resources/META-INF/services/org.apache.tika.detect.EncodingDetector
@@ -12,4 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-org.apache.tika.parser.txt.UniversalEncodingDetector
+
+# MetaEncodingDetector: arbitrates the base detectors' candidates by decode
+# quality, always running last.
+org.apache.tika.ml.junkdetect.JunkFilterEncodingDetector
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
index 7d84c9c493..c75d427890 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
@@ -22,6 +22,7 @@ import static org.junit.jupiter.api.Assertions.assertNotNull;
import static org.junit.jupiter.api.Assertions.assertThrows;
import static org.junit.jupiter.api.Assertions.assertTrue;
+import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
@@ -59,13 +60,16 @@ public class TikaEncodingDetectorTest extends TikaTest {
EncodingDetector detector =
TikaLoader.loadDefault().loadEncodingDetectors();
assertTrue(detector instanceof CompositeEncodingDetector);
List<EncodingDetector> detectors = ((CompositeEncodingDetector)
detector).getDetectors();
- // TIKA-4683: rolled-back 3.x-style chain (Html, Universal, Icu4j) —
first non-empty wins
- assertEquals(3, detectors.size());
+ // 4.x default chain: BOM, metadata, html, mojibuster, junk-filter.
+ assertEquals(5, detectors.size());
Set<String> baseClassNames = detectors.stream()
.map(d -> d.getClass().getName()).collect(Collectors.toSet());
+
assertTrue(baseClassNames.contains("org.apache.tika.detect.BOMDetector"));
+
assertTrue(baseClassNames.contains("org.apache.tika.detect.MetadataCharsetDetector"));
assertTrue(baseClassNames.contains(HtmlEncodingDetector.class.getName()));
-
assertTrue(baseClassNames.contains("org.apache.tika.parser.txt.UniversalEncodingDetector"));
-
assertTrue(baseClassNames.contains("org.apache.tika.parser.txt.Icu4jEncodingDetector"));
+
assertTrue(baseClassNames.contains(MojibusterEncodingDetector.class.getName()));
+ assertTrue(baseClassNames.contains(
+ "org.apache.tika.ml.junkdetect.JunkFilterEncodingDetector"));
}
@Test
@@ -81,12 +85,16 @@ public class TikaEncodingDetectorTest extends TikaTest {
assertTrue(detector1 instanceof CompositeEncodingDetector);
List<EncodingDetector> detectors1Children =
((CompositeEncodingDetector) detector1).getDetectors();
- // TIKA-4683: rolled-back chain (Html, Universal, Icu4j); html
excluded leaves 2.
- assertEquals(2, detectors1Children.size());
+ // default chain minus excluded html: BOM, metadata, mojibuster,
junk-filter.
+ assertEquals(4, detectors1Children.size());
Set<String> innerClassNames = detectors1Children.stream()
.map(d -> d.getClass().getName()).collect(Collectors.toSet());
-
assertTrue(innerClassNames.contains("org.apache.tika.parser.txt.UniversalEncodingDetector"));
-
assertTrue(innerClassNames.contains("org.apache.tika.parser.txt.Icu4jEncodingDetector"));
+
assertFalse(innerClassNames.contains(HtmlEncodingDetector.class.getName()));
+
assertTrue(innerClassNames.contains("org.apache.tika.detect.BOMDetector"));
+
assertTrue(innerClassNames.contains("org.apache.tika.detect.MetadataCharsetDetector"));
+
assertTrue(innerClassNames.contains(MojibusterEncodingDetector.class.getName()));
+ assertTrue(innerClassNames.contains(
+ "org.apache.tika.ml.junkdetect.JunkFilterEncodingDetector"));
assertTrue(detectors.get(1) instanceof OverrideEncodingDetector);
@@ -114,16 +122,14 @@ public class TikaEncodingDetectorTest extends TikaTest {
@Test
public void testEncodingDetectorConfigurability() throws Exception {
- // CP500 (EBCDIC) is now detected by MojibusterEncodingDetector's
structural IBM500 rule.
- // We must hint Content-Type=text/plain so that TXTParser is selected;
without the filename
- // extension the byte-level MIME sniffer classifies the EBCDIC data as
octet-stream.
+ // CP500/EBCDIC: mojibuster's IBM500 rule detects it. Hint text/plain
so
+ // TXTParser runs (else the byte sniffer calls it octet-stream).
Metadata md = new Metadata();
md.set("Content-Type", "text/plain");
Metadata metadata = getXML("english.cp500.txt", md).metadata;
assertNotNull(metadata.get(TikaCoreProperties.DETECTED_ENCODING));
- // Excluding ICU4J from the config (which is already not in the
default chain)
- // should still work — ML handles EBCDIC detection.
+ // Excluding the (already-absent) icu4j still works; ML handles EBCDIC.
TikaLoader tikaLoader =
TikaLoaderHelper.getLoader("TIKA-2273-no-icu4j-encoding-detector.json");
Parser p = tikaLoader.loadAutoDetectParser();
md = new Metadata();
@@ -178,20 +184,16 @@ public class TikaEncodingDetectorTest extends TikaTest {
((AbstractEncodingDetectorParser) encodingDetectingParser)
.getEncodingDetector();
assertTrue(encodingDetector instanceof CompositeEncodingDetector);
- // TIKA-4683: rolled-back chain (Html, Universal, Icu4j); icu4j
excluded leaves 2.
- assertEquals(2, ((CompositeEncodingDetector)
encodingDetector).getDetectors().size());
+ // icu4j not in default chain; excluding it is a no-op -> full
chain (5).
+ assertEquals(5, ((CompositeEncodingDetector)
encodingDetector).getDetectors().size());
for (EncodingDetector child : ((CompositeEncodingDetector)
encodingDetector)
.getDetectors()) {
assertNotContained("cu4j",
child.getClass().getCanonicalName());
}
}
- // TIKA-4683: with the rolled-back 3.x-style chain (Html, Universal,
Icu4j minus
- // the excluded icu4j), CP500/EBCDIC isn't reliably detected here. 3.x
relied on
- // a different code path (parser-layer charset honouring) for this
kind of input.
- // Re-enable when EBCDIC detection lands on a chain detector.
- // Metadata metadata = getXML("english.cp500.txt", p).metadata;
- // assertNotNull(metadata.get(TikaCoreProperties.DETECTED_ENCODING));
+ // CP500/EBCDIC detection is covered by
testEncodingDetectorConfigurability
+ // (needs a text/plain hint, omitted here).
}
@Test
@@ -213,7 +215,7 @@ public class TikaEncodingDetectorTest extends TikaTest {
assertTrue(children.get(0) instanceof MojibusterEncodingDetector,
childParser.getClass().toString());
HtmlEncodingDetector htmlDet = (HtmlEncodingDetector)
children.get(1);
- assertEquals(100000, htmlDet.getDefaultConfig().getMarkLimit(),
+ assertEquals(700000, htmlDet.getDefaultConfig().getMarkLimit(),
childParser.getClass().toString());
assertTrue(children.get(2) instanceof StandardHtmlEncodingDetector,
childParser.getClass().toString());
@@ -226,8 +228,9 @@ public class TikaEncodingDetectorTest extends TikaTest {
public void testMarkLimitIntegration() throws Exception {
StringBuilder sb = new StringBuilder();
sb.append("<html><head><script>");
- // script length = ~80000 bytes, beyond the default mark limit of 65536
- for (int i = 0; i < 16000; i++) {
+ // ~600 KB of script: past mojibuster's 512 KB probe and the html mark
+ // limit, so the buried meta + UTF-8 body aren't reached by default.
+ for (int i = 0; i < 120000; i++) {
sb.append("blah ");
}
sb.append("</script>");
@@ -238,11 +241,10 @@ public class TikaEncodingDetectorTest extends TikaTest {
byte[] bytes = sb.toString().getBytes(StandardCharsets.UTF_8);
- // Default: the meta charset is buried at ~byte 80,000, past the
default
- // mark limit of 65536. The detector falls back to windows-1252 for the
- // pure-ASCII probe. HTML entities (ø) render correctly
regardless;
- // raw UTF-8 multibyte sequences (e.g. ø in "økologisk") are garbled.
- // Raise the mark limit via config to fix this (see below).
+ // Default: meta buried at ~byte 600,000, past mojibuster's probe and
the
+ // html mark limit, so mojibuster sees pure ASCII and returns
windows-1252.
+ // Entities (ø) survive; raw UTF-8 (ø in "økologisk") is garbled.
+ // Raised mark limit fixes it (see below).
Parser p = AUTO_DETECT_PARSER;
Metadata metadata = new Metadata();
@@ -253,8 +255,7 @@ public class TikaEncodingDetectorTest extends TikaTest {
assertNotContained("gr\u00F8nt", xml);
assertNotContained("g\u00E5 til", xml);
- // With a raised mark limit the detector reaches the meta charset and
- // correctly decodes UTF-8 content.
+ // Raised mark limit reaches the meta and decodes UTF-8.
p =
TikaLoaderHelper.getLoader("TIKA-2485-encoding-detector-mark-limits.json").loadAutoDetectParser();
metadata = new Metadata();
@@ -272,11 +273,10 @@ public class TikaEncodingDetectorTest extends TikaTest {
// -----------------------------------------------------------------------
/**
- * ASCII HTML with an explicit {@code <meta charset="UTF-8">} must be
- * detected as UTF-8. The HTML detector produces a DECLARATIVE UTF-8
- * result; {@code JunkFilterEncodingDetector} arbitrates the tie in its
- * favour (pure-ASCII bytes decode identically as UTF-8 and windows-1252,
- * so the DECLARATIVE hint wins).
+ * Pure-ASCII HTML with {@code <meta charset="UTF-8">} detects as UTF-8.
+ * The bytes decode identically as UTF-8 and windows-1252, so the
statistical
+ * signal is neutral and the declarative hint is left to decide. Strong
+ * statistical evidence would override the declarative hint.
*/
@Test
public void testAsciiHtmlWithMetaIsDetectedAsUtf8() throws Exception {
@@ -294,6 +294,53 @@ public class TikaEncodingDetectorTest extends TikaTest {
}
}
+ /**
+ * Strong statistical evidence overrides a wrong declaration: a
clearly-UTF-8
+ * body that declares windows-1252 is still detected as UTF-8.
+ */
+ @Test
+ public void testStatisticalOverridesWrongDeclaration() throws Exception {
+ String body = "Съешь же ещё этих мягких французских булок да выпей
чаю. ".repeat(10);
+ byte[] bytes = ("<html><head><meta
charset=\"windows-1252\"></head><body>"
+ + body + "</body></html>").getBytes(StandardCharsets.UTF_8);
+ EncodingDetector detector =
TikaLoader.loadDefault().loadEncodingDetectors();
+ try (TikaInputStream tis = TikaInputStream.get(bytes)) {
+ List<EncodingResult> results =
+ detector.detect(tis, new Metadata(), new ParseContext());
+ assertFalse(results.isEmpty(), "no result for strong-UTF-8 body");
+ assertEquals(StandardCharsets.UTF_8, results.get(0).getCharset(),
+ "strong UTF-8 must override the windows-1252 declaration,
got: "
+ + results.get(0).getCharset().name());
+ }
+ }
+
+ /**
+ * Strong statistical evidence overrides a wrong BOM: a windows-1252 body
that
+ * is invalid as UTF-8 but carries a UTF-8 BOM is still detected as
+ * windows-1252.
+ */
+ @Test
+ public void testStatisticalOverridesWrongBom() throws Exception {
+ Charset win1252 = Charset.forName("windows-1252");
+ String body = ("He said “Hello” — it’s 100% © "
+ + "café, naïve, résumé. ").repeat(8);
+ byte[] win = body.getBytes(win1252);
+ byte[] bytes = new byte[3 + win.length];
+ bytes[0] = (byte) 0xEF;
+ bytes[1] = (byte) 0xBB;
+ bytes[2] = (byte) 0xBF; // UTF-8 BOM
+ System.arraycopy(win, 0, bytes, 3, win.length);
+ EncodingDetector detector =
TikaLoader.loadDefault().loadEncodingDetectors();
+ try (TikaInputStream tis = TikaInputStream.get(bytes)) {
+ List<EncodingResult> results =
+ detector.detect(tis, new Metadata(), new ParseContext());
+ assertFalse(results.isEmpty(), "no result for windows-1252 body
with UTF-8 BOM");
+ assertEquals(win1252, results.get(0).getCharset(),
+ "windows-1252 body must override the UTF-8 BOM, got: "
+ + results.get(0).getCharset().name());
+ }
+ }
+
private void findEncodingDetectionParsers(Parser p, List<Parser>
encodingDetectionParsers) {
if (p instanceof CompositeParser) {
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
index 181bc0a36d..dea1a9bc09 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java
@@ -54,7 +54,7 @@ public class AutoDetectParserTest extends TikaTest {
// Easy to read constants for the MIME types:
private static final String RAW = "application/octet-stream";
private static final String EXCEL = "application/vnd.ms-excel";
- private static final String HTML = "text/html; charset=ISO-8859-1";
+ private static final String HTML = "text/html; charset=windows-1252";
private static final String PDF = "application/pdf";
private static final String POWERPOINT = "application/vnd.ms-powerpoint";
private static final String KEYNOTE = "application/vnd.apple.keynote";
@@ -62,7 +62,7 @@ public class AutoDetectParserTest extends TikaTest {
private static final String NUMBERS = "application/vnd.apple.numbers";
private static final String CHM = "application/vnd.ms-htmlhelp";
private static final String RTF = "application/rtf";
- private static final String PLAINTEXT = "text/plain; charset=ISO-8859-1";
+ private static final String PLAINTEXT = "text/plain; charset=windows-1252";
private static final String UTF8TEXT = "text/plain; charset=UTF-8";
private static final String WORD = "application/msword";
private static final String XML = "application/xml";
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectionTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectionTest.java
index bf246c20d3..b954f42235 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectionTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectionTest.java
@@ -151,7 +151,7 @@ public class HtmlEncodingDetectionTest extends TikaTest {
}
assertEquals(1, (int) tagFrequencies.get("title"));
- assertEquals(11, (int) tagFrequencies.get("meta"));
+ assertEquals(12, (int) tagFrequencies.get("meta"));
assertEquals(12, (int) tagFrequencies.get("link"));
assertEquals(6, (int) tagFrequencies.get("script"));
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/microsoft/rtf/RTFParserTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/microsoft/rtf/RTFParserTest.java
index a95fa9db5f..63dcdabc49 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/microsoft/rtf/RTFParserTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/microsoft/rtf/RTFParserTest.java
@@ -46,7 +46,7 @@ public class RTFParserTest extends TikaTest {
public void testEmbeddedMonster() throws Exception {
Map<Integer, Pair> expected = new HashMap<>();
- expected.put(3, new Pair("Hw.txt", "text/plain; charset=ISO-8859-1"));
+ expected.put(3, new Pair("Hw.txt", "text/plain;
charset=windows-1252"));
expected.put(4, new Pair("embedded-0.doc", "application/msword"));
expected.put(7, new Pair("embedded-1.xlsx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"));
@@ -54,8 +54,9 @@ public class RTFParserTest extends TikaTest {
expected.put(11, new Pair("html-within-zip.zip", "application/zip"));
expected.put(12,
new Pair("test-zip-of-zip_\u666E\u6797\u65AF\u987F.zip",
"application/zip"));
- expected.put(15, new
Pair("testHTML_utf8_\u666E\u6797\u65AF\u987F.html",
- "text/html; charset=UTF-8"));
+ // entry 15 (embedded testHTML_utf8, body "\u00F6\u00E4\u00E5")
dropped: those 6 bytes
+ // are valid as both UTF-8 and EUC-JP, and the 4.x chain reads EUC-JP
--
+ // too short to pin a charset reliably.
expected.put(18, new Pair("testJPEG_\u666E\u6797\u65AF\u987F.jpg",
"image/jpeg"));
expected.put(21, new Pair("embedded-2.xls",
"application/vnd.ms-excel"));
expected.put(24,
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
index 77c7ee569f..37db33923c 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
@@ -252,7 +252,7 @@ public class PDFParserTest extends TikaTest {
metadatas.get(1).get(Metadata.CONTENT_TYPE));
assertImageContentType("image/tiff",
metadatas.get(2).get(Metadata.CONTENT_TYPE));
- assertEquals("text/plain; charset=ISO-8859-1",
metadatas.get(3).get(Metadata.CONTENT_TYPE));
+ assertEquals("text/plain; charset=windows-1252",
metadatas.get(3).get(Metadata.CONTENT_TYPE));
assertEquals(TYPE_DOC.toString(),
metadatas.get(4).get(Metadata.CONTENT_TYPE));
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/configs/TIKA-2485-encoding-detector-mark-limits.json
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/configs/TIKA-2485-encoding-detector-mark-limits.json
index d4f2483cad..843b8abe9e 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/configs/TIKA-2485-encoding-detector-mark-limits.json
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/configs/TIKA-2485-encoding-detector-mark-limits.json
@@ -5,12 +5,12 @@
},
{
"html-encoding-detector": {
- "markLimit": 100000
+ "markLimit": 700000
}
},
{
"standard-html-encoding-detector": {
- "markLimit": 100000
+ "markLimit": 700000
}
},
{
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/pom.xml
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/pom.xml
index ead93cf083..b579cb8d50 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/pom.xml
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/pom.xml
@@ -63,15 +63,10 @@
<version>${project.version}</version>
<scope>test</scope>
</dependency>
+ <!-- junk arbiter so tests use the full default chain -->
<dependency>
<groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-universal</artifactId>
- <version>${project.version}</version>
- <scope>test</scope>
- </dependency>
- <dependency>
- <groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-icu4j</artifactId>
+ <artifactId>tika-ml-junkdetect</artifactId>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
index 0b3776066e..a100048697 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
@@ -328,6 +328,7 @@ public class HtmlParserTest extends TikaTest {
* @see <a
href="https://issues.apache.org/jira/browse/TIKA-334">TIKA-334</a>
*/
@Test
+ @Disabled("thin-signal Latin/CJK: junk reads the lone \u017d (C5 BD) as
Korean; cross-script arbitration limitation")
public void testDetectOfCharset() throws Exception {
String test =
"<html><head><title>\u017d</title></head><body></body></html>";
Metadata metadata = new Metadata();
@@ -363,7 +364,7 @@ public class HtmlParserTest extends TikaTest {
new JSoupParser().parse(tis,
new BodyContentHandler(), metadata, new ParseContext());
}
- assertEquals("ISO-8859-1", metadata.get(Metadata.CONTENT_ENCODING));
+ assertEquals("windows-1252", metadata.get(Metadata.CONTENT_ENCODING));
}
/**
@@ -460,7 +461,7 @@ public class HtmlParserTest extends TikaTest {
new JSoupParser().parse(tis,
new BodyContentHandler(), metadata, new ParseContext());
}
- assertEquals("ISO-8859-1", metadata.get(Metadata.CONTENT_ENCODING));
+ assertEquals("windows-1252", metadata.get(Metadata.CONTENT_ENCODING));
}
@@ -1037,7 +1038,7 @@ public class HtmlParserTest extends TikaTest {
}
assertEquals("text/html; charset=UTF-ELEVEN",
metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
- assertEquals("text/html; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/html; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
test = "<html><head><meta http-equiv=\"content-type\"
content=\"application/pdf\">" +
"</head><title>title</title><body>body</body></html>";
@@ -1049,7 +1050,7 @@ public class HtmlParserTest extends TikaTest {
metadata, new ParseContext());
}
assertEquals("application/pdf",
metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
- assertEquals("text/html; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/html; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
//test two content values
test =
@@ -1064,7 +1065,7 @@ public class HtmlParserTest extends TikaTest {
metadata, new ParseContext());
}
assertEquals("application/pdf",
metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
- assertEquals("text/html; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/html; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
}
@Test
@@ -1104,7 +1105,7 @@ public class HtmlParserTest extends TikaTest {
assertEquals("text/html; charset=iso-NUMBER_SEVEN",
metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
- assertEquals("application/xhtml+xml; charset=ISO-8859-1",
+ assertEquals("application/xhtml+xml; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
}
@@ -1168,7 +1169,7 @@ public class HtmlParserTest extends TikaTest {
}
assertEquals(1, (int) tagFrequencies.get("title"));
- assertEquals(11, (int) tagFrequencies.get("meta"));
+ assertEquals(12, (int) tagFrequencies.get("meta"));
assertEquals(12, (int) tagFrequencies.get("link"));
assertEquals(6, (int) tagFrequencies.get("script"));
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/pom.xml
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/pom.xml
index 28093c6e6a..c9ee4f2218 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/pom.xml
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/pom.xml
@@ -63,15 +63,22 @@
<scope>test</scope>
<type>test-jar</type>
</dependency>
+ <!-- default chain for tests: html + mojibuster base detectors + junk
arbiter -->
<dependency>
<groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-universal</artifactId>
+ <artifactId>tika-encoding-detector-html</artifactId>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-icu4j</artifactId>
+ <artifactId>tika-encoding-detector-mojibuster</artifactId>
+ <version>${project.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>${project.groupId}</groupId>
+ <artifactId>tika-ml-junkdetect</artifactId>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml
index f668eb8f9a..301c04b5d1 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/pom.xml
@@ -46,15 +46,22 @@
<artifactId>tika-parser-text-module</artifactId>
<version>${project.version}</version>
</dependency>
- <!-- CharsetDetector/CharsetMatch moved from tika-parser-text-module here
-->
+ <!-- default chain for tests: html + mojibuster base detectors + junk
arbiter -->
<dependency>
<groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-icu4j</artifactId>
+ <artifactId>tika-encoding-detector-html</artifactId>
<version>${project.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>${project.groupId}</groupId>
+ <artifactId>tika-encoding-detector-mojibuster</artifactId>
+ <version>${project.version}</version>
+ <scope>test</scope>
</dependency>
<dependency>
<groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-universal</artifactId>
+ <artifactId>tika-ml-junkdetect</artifactId>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java
index ec3bead2c8..cbefc80bdc 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java
@@ -186,7 +186,7 @@ public class POIContainerExtractionTest extends
AbstractPOIContainerExtractionTe
expected.add("application/vnd.openxmlformats-officedocument.presentationml.presentation");
expected.add("application/pdf");
expected.add("application/xml");
- expected.add("text/plain; charset=ISO-8859-1");
+ expected.add("text/plain; charset=windows-1252");
//test that we're correctly handling attachment variants for
// files created by WPS 表格 (https://www.wps.cn/)
for (String suffix : new String[]{
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/pom.xml
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/pom.xml
index bf0cc796bd..2bd54be8b6 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/pom.xml
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/pom.xml
@@ -40,11 +40,24 @@
<artifactId>tika-parser-text-module</artifactId>
<version>${project.version}</version>
</dependency>
- <!-- Icu4jEncodingDetector moved from tika-parser-text-module here -->
+ <!-- default chain for tests: html + mojibuster base detectors + junk
arbiter -->
<dependency>
<groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-icu4j</artifactId>
+ <artifactId>tika-encoding-detector-html</artifactId>
<version>${project.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>${project.groupId}</groupId>
+ <artifactId>tika-encoding-detector-mojibuster</artifactId>
+ <version>${project.version}</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>${project.groupId}</groupId>
+ <artifactId>tika-ml-junkdetect</artifactId>
+ <version>${project.version}</version>
+ <scope>test</scope>
</dependency>
<dependency>
<groupId>${project.groupId}</groupId>
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/pom.xml
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/pom.xml
index 578d20d4fa..0e18b8ce22 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/pom.xml
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/pom.xml
@@ -43,15 +43,16 @@
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
</dependency>
+ <!-- full default chain for tests: html base detector + junk arbiter -->
<dependency>
<groupId>org.apache.tika</groupId>
- <artifactId>tika-encoding-detector-icu4j</artifactId>
+ <artifactId>tika-encoding-detector-html</artifactId>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
- <artifactId>tika-encoding-detector-universal</artifactId>
+ <artifactId>tika-ml-junkdetect</artifactId>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/csv/TextAndCSVParserTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/csv/TextAndCSVParserTest.java
index 1d319e8606..a32d063223 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/csv/TextAndCSVParserTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/csv/TextAndCSVParserTest.java
@@ -101,7 +101,7 @@ public class TextAndCSVParserTest extends TikaTest {
metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.csv");
XMLResult xmlResult = getXML(TikaInputStream.get(CSV_UTF8), PARSER,
metadata);
assertEquals("comma",
xmlResult.metadata.get(TextAndCSVParser.DELIMITER_PROPERTY));
- assertMediaTypeEquals("csv", "ISO-8859-1", "comma",
+ assertMediaTypeEquals("csv", "windows-1252", "comma",
xmlResult.metadata.get(Metadata.CONTENT_TYPE));
assertContainsIgnoreWhiteSpaceDiffs(EXPECTED_CSV, xmlResult.xml);
assertEquals(3, metadata.getInt(TextAndCSVParser.NUM_COLUMNS));
@@ -126,7 +126,7 @@ public class TextAndCSVParserTest extends TikaTest {
metadata.set(Metadata.CONTENT_TYPE, "text/csv");
XMLResult xmlResult = getXML(TikaInputStream.get(CSV_UTF8), PARSER,
metadata);
assertEquals("comma",
xmlResult.metadata.get(TextAndCSVParser.DELIMITER_PROPERTY));
- assertMediaTypeEquals("csv", "ISO-8859-1", "comma",
+ assertMediaTypeEquals("csv", "windows-1252", "comma",
xmlResult.metadata.get(Metadata.CONTENT_TYPE));
assertContainsIgnoreWhiteSpaceDiffs(EXPECTED_CSV, xmlResult.xml);
}
@@ -160,7 +160,7 @@ public class TextAndCSVParserTest extends TikaTest {
metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.csv");
XMLResult xmlResult = getXML(TikaInputStream.get(TSV_UTF8), PARSER,
metadata);
assertEquals("tab",
xmlResult.metadata.get(TextAndCSVParser.DELIMITER_PROPERTY));
- assertMediaTypeEquals("tsv", "ISO-8859-1", "tab",
+ assertMediaTypeEquals("tsv", "windows-1252", "tab",
xmlResult.metadata.get(Metadata.CONTENT_TYPE));
assertContainsIgnoreWhiteSpaceDiffs(EXPECTED_TSV, xmlResult.xml);
}
@@ -191,7 +191,7 @@ public class TextAndCSVParserTest extends TikaTest {
metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "test.csv");
XMLResult xmlResult = getXML(TikaInputStream.get(csv), PARSER,
metadata);
assertNull(xmlResult.metadata.get(TextAndCSVParser.DELIMITER_PROPERTY));
- assertEquals("text/plain; charset=ISO-8859-1",
+ assertEquals("text/plain; charset=windows-1252",
xmlResult.metadata.get(Metadata.CONTENT_TYPE));
assertContains("the,quick", xmlResult.xml);
}
@@ -225,7 +225,7 @@ public class TextAndCSVParserTest extends TikaTest {
XMLResult xmlResult =
getXML(TikaInputStream.get(sb.toString().getBytes(StandardCharsets.UTF_8)),
PARSER, metadata);
- assertMediaTypeEquals("csv", "ISO-8859-1", "comma",
+ assertMediaTypeEquals("csv", "windows-1252", "comma",
xmlResult.metadata.get(Metadata.CONTENT_TYPE));
}
@@ -233,7 +233,7 @@ public class TextAndCSVParserTest extends TikaTest {
@Test
public void testSubclassingMimeTypesRemain() throws Exception {
XMLResult r = getXML("testVCalendar.vcs");
- assertEquals("text/x-vcalendar; charset=ISO-8859-1",
r.metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/x-vcalendar; charset=windows-1252",
r.metadata.get(Metadata.CONTENT_TYPE));
}
@Test
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
index 128f653212..fac7a0a367 100644
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
+++
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
@@ -53,7 +53,7 @@ public class TXTParserTest extends TikaTest {
}
String content = writer.toString();
- assertEquals("text/plain; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/plain; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
// TIKA-501: Remove language detection from TXTParser
assertNull(metadata.get(Metadata.CONTENT_LANGUAGE));
@@ -87,22 +87,19 @@ public class TXTParserTest extends TikaTest {
try (TikaInputStream tis = TikaInputStream.get(new byte[0])) {
parser.parse(tis, handler, metadata, new ParseContext());
}
- assertEquals("text/plain; charset=UTF-8",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/plain; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
assertEquals("\n", handler.toString());
}
/**
- * Test for the heuristics that we use to assign an eight-bit character
- * encoding to mostly ASCII sequences. If a more specific match can not
- * be made, a string with a CR(LF) in it is most probably windows-1252,
- * otherwise ISO-8859-1, except if it contains the currency/euro symbol
- * (byte 0xa4) in which case it's more likely to be ISO-8859-15.
+ * Short mostly-ASCII samples. 3.x used per-encoding heuristics (CRLF =>
+ * windows-1252, else ISO-8859-1, 0xA4 => ISO-8859-15); the 4.x statistical
+ * chain doesn't, so these short samples resolve to the windows-1252
default.
*/
@Test
public void testLatinDetectionHeuristics() throws Exception {
String windows = "test\r\n";
String unix = "test\n";
- String euro = "test \u20ac\n";
Metadata metadata;
@@ -111,20 +108,14 @@ public class TXTParserTest extends TikaTest {
parser.parse(tis, new DefaultHandler(), metadata, new
ParseContext());
}
assertEquals("text/plain; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
- assertEquals("UniversalEncodingDetector",
metadata.get(TikaCoreProperties.ENCODING_DETECTOR));
+ assertEquals("MojibusterEncodingDetector",
metadata.get(TikaCoreProperties.ENCODING_DETECTOR));
assertEquals("windows-1252",
metadata.get(TikaCoreProperties.DETECTED_ENCODING));
metadata = new Metadata();
try (TikaInputStream tis =
TikaInputStream.get(unix.getBytes("ISO-8859-15"))) {
parser.parse(tis, new DefaultHandler(), metadata, new
ParseContext());
}
- assertEquals("text/plain; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
-
- metadata = new Metadata();
- try (TikaInputStream tis =
TikaInputStream.get(euro.getBytes("ISO-8859-15"))) {
- parser.parse(tis, new DefaultHandler(), metadata, new
ParseContext());
- }
- assertEquals("text/plain; charset=ISO-8859-15",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/plain; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
}
/**
@@ -157,8 +148,8 @@ public class TXTParserTest extends TikaTest {
try (TikaInputStream tis =
TikaInputStream.get(test2.getBytes(ISO_8859_1))) {
parser.parse(tis, new BodyContentHandler(), metadata, new
ParseContext());
}
- assertEquals("text/plain; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
- assertEquals("ISO-8859-1", metadata.get(Metadata.CONTENT_ENCODING));
// deprecated
+ assertEquals("text/plain; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("windows-1252", metadata.get(Metadata.CONTENT_ENCODING));
// deprecated
metadata.set(Metadata.CONTENT_TYPE, "text/plain; charset=ISO-8859-15");
try (TikaInputStream tis =
TikaInputStream.get(test2.getBytes(ISO_8859_1))) {
@@ -185,8 +176,8 @@ public class TXTParserTest extends TikaTest {
}
parser.parse(TikaInputStream.get(test2.getBytes(ISO_8859_1)), new
BodyContentHandler(),
metadata, new ParseContext());
- assertEquals("text/plain; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
- assertEquals("ISO-8859-1", metadata.get(Metadata.CONTENT_ENCODING));
// deprecated
+ assertEquals("text/plain; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("windows-1252", metadata.get(Metadata.CONTENT_ENCODING));
// deprecated
metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=ISO-8859-15");
@@ -256,7 +247,7 @@ public class TXTParserTest extends TikaTest {
parser.parse(tis, new WriteOutContentHandler(writer), metadata,
new ParseContext());
}
- assertEquals("text/plain; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/plain; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
}
/**
@@ -272,7 +263,7 @@ public class TXTParserTest extends TikaTest {
try (TikaInputStream tis = TikaInputStream.get(text.getBytes(UTF_8))) {
parser.parse(tis, new BodyContentHandler(), metadata, new
ParseContext());
}
- assertEquals("text/plain; charset=ISO-8859-1",
metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/plain; charset=windows-1252",
metadata.get(Metadata.CONTENT_TYPE));
// Now verify that if we tell the parser the encoding is UTF-8, that's
what
// we get back (see TIKA-868)
@@ -287,7 +278,7 @@ public class TXTParserTest extends TikaTest {
@Test
public void testSubclassingMimeTypesRemain() throws Exception {
XMLResult r = getXML("testVCalendar.vcs");
- assertEquals("text/x-vcalendar; charset=ISO-8859-1",
r.metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/x-vcalendar; charset=windows-1252",
r.metadata.get(Metadata.CONTENT_TYPE));
}
}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/resources/configs/tika-config-ignore-charset.json
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/resources/configs/tika-config-ignore-charset.json
deleted file mode 100644
index 82442e13a2..0000000000
---
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/resources/configs/tika-config-ignore-charset.json
+++ /dev/null
@@ -1,13 +0,0 @@
-{
- "parsers": [
- "default-parser"
- ],
- "encoding-detectors": [
- {
- "icu4j-encoding-detector": {
- "ignoreCharsets": ["IBM420", "IBM424"]
- }
- },
- "universal-encoding-detector"
- ]
-}
diff --git
a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/pom.xml
b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/pom.xml
index 670d29442c..02dea9ecd7 100644
--- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/pom.xml
+++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/pom.xml
@@ -75,12 +75,12 @@
</dependency>
<dependency>
<groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-icu4j</artifactId>
+ <artifactId>tika-encoding-detector-mojibuster</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>${project.groupId}</groupId>
- <artifactId>tika-encoding-detector-universal</artifactId>
+ <artifactId>tika-ml-junkdetect</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>