Sebastian Nagel created TIKA-3612:
-------------------------------------

             Summary: Update StandardHtmlEncodingDetector to follow the living 
standard
                 Key: TIKA-3612
                 URL: https://issues.apache.org/jira/browse/TIKA-3612
             Project: Tika
          Issue Type: Improvement
          Components: detector
    Affects Versions: 2.1.0
            Reporter: Sebastian Nagel


[StandardHtmlEncodingDetector|https://github.com/apache/tika/blob/71f7e50cbd8d18a2bf269e240593e4a398b9f8ee/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java#L84]
 uses 3 heuristics to detect the encoding of a HTML document:
 # BOM
 # Content-Type HTTP header
 # HTML <meta> tag

The ["living standard", 13.2.3.2 Determining the character 
encoding|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding]
 has evolved since then and is based one a longer chain of steps/approaches, 
including char/byte statistics ("The user agent may attempt to autodetect the 
character encoding from applying frequency analysis or other algorithms to the 
data stream.") and (definitely useful) a list of fall-back encodings based on 
the content language and if the document is not encoded using one of the UTF 
encodings.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to