[
https://issues.apache.org/jira/browse/TIKA-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick C updated TIKA-2421:
-------------------------
Attachment: test.html
Here is an example file that if you run through Tika gives a bunch of Chinese
characters. Also found
[this|https://www.w3.org/International/questions/qa-html-encoding-declarations#utf16].
The HTML5 specification actually forbids the use of the meta element to
declare UTF-16.
> HTML Encoding Detector should ignore UTF-16 and UTF-32
> ------------------------------------------------------
>
> Key: TIKA-2421
> URL: https://issues.apache.org/jira/browse/TIKA-2421
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.13
> Reporter: Nick C
> Priority: Minor
> Attachments: test.html
>
>
> HTMLEncodingDetector interprets the head as ASCII when parsing the meta tag
> for a possible encoding. It should ignore html pages that specify UTF-16 or
> 32 because the page obviously can't be due to the meta tag being in
> ASCII/UTF-8
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)