[jira] [Comment Edited] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

Gerard Bouchar (JIRA) Fri, 22 Jun 2018 05:05:43 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520117#comment-16520117
 ]


Gerard Bouchar edited comment on TIKA-2673 at 6/22/18 12:04 PM:
----------------------------------------------------------------

[[email protected]] : Thank you very much for integrating my tests and for 
your changes to HtmlEncodingDetector! I think at least the utf16 test shouldn't 
be ignored. HtmlEncodingDetector does its detection using regular expressions 
on the byte stream decoded as ASCII. So if the file were actually in UTF-16 (a 
two bytes per character encoding that is not compatible with ASCII), then it 
wouldn't have matched the regular expression in the first place. Decoding it as 
UTF-16 will almost certainly result in garbled text. [The 
specification|https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream]
 was written by people with experience in real-world misuses of character 
encodings on the web, I think we can confidently trust it concerning various 
edge-cases.


was (Author: gbouchar):
[[email protected]] I think at least the utf16 test shouldn't be ignored. 
HtmlEncodingDetector does its detection using regular expressions on the byte 
stream decoded as ASCII. So if the file were actually in UTF-16 (a two bytes 
per character encoding that is not compatible with ASCII), then it wouldn't 
have matched the regular expression in the first place. Decoding it as UTF-16 
will almost certainly result in garbled text. [The 
specification|https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream]
 was written by people with experience in real-world misuses of character 
encodings on the web, I think we can confidently trust it concerning various 
edge-cases. 

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>         Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

Reply via email to