Sebastian Nagel created TIKA-3612:
-------------------------------------
Summary: Update StandardHtmlEncodingDetector to follow the living
standard
Key: TIKA-3612
URL: https://issues.apache.org/jira/browse/TIKA-3612
Project: Tika
Issue Type: Improvement
Components: detector
Affects Versions: 2.1.0
Reporter: Sebastian Nagel
[StandardHtmlEncodingDetector|https://github.com/apache/tika/blob/71f7e50cbd8d18a2bf269e240593e4a398b9f8ee/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java#L84]
uses 3 heuristics to detect the encoding of a HTML document:
# BOM
# Content-Type HTTP header
# HTML <meta> tag
The ["living standard", 13.2.3.2 Determining the character
encoding|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding]
has evolved since then and is based one a longer chain of steps/approaches,
including char/byte statistics ("The user agent may attempt to autodetect the
character encoding from applying frequency analysis or other algorithms to the
data stream.") and (definitely useful) a list of fall-back encodings based on
the content language and if the document is not encoded using one of the UTF
encodings.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)