[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520127#comment-16520127
 ] 

Gerard Bouchar edited comment on TIKA-2673 at 6/22/18 12:06 PM:
----------------------------------------------------------------

Another part of the specification I think we should respect is [character 
encoding names and labels|https://encoding.spec.whatwg.org/#names-and-labels]. 
Several aliases are made from aliases to different charset names, and I think 
using the labels in this table makes more sense then using the ones defined by 
java (that were not meant to be used in HTML, or to be compatible with HTML).


was (Author: gbouchar):
Another part of the specification I think we should respect is [character 
encoding names and labels|https://encoding.spec.whatwg.org/#names-and-labels]. 
Several aliases are made from aliases to different chraset names, and I think 
using the labels in this table makes more sense then using the ones defined by 
java (that were not meant to be used in HTML, or to be in any way compatible 
with HTML). 

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>         Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to