[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520117#comment-16520117
]
Gerard Bouchar edited comment on TIKA-2673 at 6/22/18 12:04 PM:
----------------------------------------------------------------
[[email protected]] : Thank you very much for integrating my tests and for
your changes to HtmlEncodingDetector! I think at least the utf16 test shouldn't
be ignored. HtmlEncodingDetector does its detection using regular expressions
on the byte stream decoded as ASCII. So if the file were actually in UTF-16 (a
two bytes per character encoding that is not compatible with ASCII), then it
wouldn't have matched the regular expression in the first place. Decoding it as
UTF-16 will almost certainly result in garbled text. [The
specification|https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream]
was written by people with experience in real-world misuses of character
encodings on the web, I think we can confidently trust it concerning various
edge-cases.
was (Author: gbouchar):
[[email protected]] I think at least the utf16 test shouldn't be ignored.
HtmlEncodingDetector does its detection using regular expressions on the byte
stream decoded as ASCII. So if the file were actually in UTF-16 (a two bytes
per character encoding that is not compatible with ASCII), then it wouldn't
have matched the regular expression in the first place. Decoding it as UTF-16
will almost certainly result in garbled text. [The
specification|https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream]
was written by people with experience in real-world misuses of character
encodings on the web, I think we can confidently trust it concerning various
edge-cases.
> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Priority: Major
> Attachments: HtmlEncodingDetectorTest.java,
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where
> HtmlEncodingDetector differs from the specification, and thus fails at
> detecting the right charset.
> I am attaching the test cases to this issue:
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)