[ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520117#comment-16520117 ]
Gerard Bouchar edited comment on TIKA-2673 at 6/22/18 12:06 PM: ---------------------------------------------------------------- [~talli...@apache.org] : Thank you very much for integrating my tests and for your changes to HtmlEncodingDetector! I think at least the utf16 test shouldn't be ignored. HtmlEncodingDetector does its detection using regular expressions on the byte stream decoded as ASCII. So if the file were actually in UTF-16 (a two bytes per character encoding that is not compatible with ASCII), then it wouldn't have matched the regular expression in the first place. Decoding it as UTF-16 will almost certainly result in garbled text. [The specification|https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream] was written by people with experience in real-world misuses of character encodings on the web, I think we can confidently trust it concerning various edge-cases such as this one. was (Author: gbouchar): [~talli...@apache.org] : Thank you very much for integrating my tests and for your changes to HtmlEncodingDetector! I think at least the utf16 test shouldn't be ignored. HtmlEncodingDetector does its detection using regular expressions on the byte stream decoded as ASCII. So if the file were actually in UTF-16 (a two bytes per character encoding that is not compatible with ASCII), then it wouldn't have matched the regular expression in the first place. Decoding it as UTF-16 will almost certainly result in garbled text. [The specification|https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream] was written by people with experience in real-world misuses of character encodings on the web, I think we can confidently trust it concerning various edge-cases. > HtmlEncodingDetector doesn't follow the specification > ----------------------------------------------------- > > Key: TIKA-2673 > URL: https://issues.apache.org/jira/browse/TIKA-2673 > Project: Tika > Issue Type: Bug > Reporter: Gerard Bouchar > Priority: Major > Attachments: HtmlEncodingDetectorTest.java, > StrictHtmlEncodingDetector.tar.gz > > > This bug is linked to TIKA-2671, but does not concern metadata, but rather > the bytes-based detection itself. > While reading the specification, I collected a list of sample cases where > HtmlEncodingDetector differs from the specification, and thus fails at > detecting the right charset. > I am attaching the test cases to this issue: -- This message was sent by Atlassian JIRA (v7.6.3#76005)