[ 
https://issues.apache.org/jira/browse/TIKA-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick C updated TIKA-2421:
-------------------------
    Attachment: test.html

Here is an example file that if you run through Tika gives a bunch of Chinese 
characters. Also found 
[this|https://www.w3.org/International/questions/qa-html-encoding-declarations#utf16].
 The HTML5 specification actually forbids the use of the meta element to 
declare UTF-16.

> HTML Encoding Detector should ignore UTF-16 and UTF-32
> ------------------------------------------------------
>
>                 Key: TIKA-2421
>                 URL: https://issues.apache.org/jira/browse/TIKA-2421
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Nick C
>            Priority: Minor
>         Attachments: test.html
>
>
> HTMLEncodingDetector interprets the head as ASCII when parsing the meta tag 
> for a possible encoding. It should ignore html pages that specify UTF-16 or 
> 32 because the page obviously can't be due to the meta tag being in 
> ASCII/UTF-8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to