[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

Tim Allison (JIRA) Thu, 21 Jun 2018 11:53:32 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519679#comment-16519679
 ]


Tim Allison commented on TIKA-2673:
-----------------------------------

[~gbouchar], thank you for these unit tests!  I've added them and made the easy 
fixes where I could.  As you know, to do a full parse is non-trivial, and I'd 
like evidence from some corpus that the effort is worth it.  

 

If you'd like to contribute a StrictHTMLEncodingDetector, we could compare the 
performance of that with what we have on our 1TB regression corpus.

 

If you'd like access to our VM either to run your own comparisons or to help us 
curate it and make it more representative of modern websites with diverse 
languages and encodings, let me know.

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>         Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

Reply via email to