[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519679#comment-16519679
]
Tim Allison commented on TIKA-2673:
-----------------------------------
[~gbouchar], thank you for these unit tests! I've added them and made the easy
fixes where I could. As you know, to do a full parse is non-trivial, and I'd
like evidence from some corpus that the effort is worth it.
If you'd like to contribute a StrictHTMLEncodingDetector, we could compare the
performance of that with what we have on our 1TB regression corpus.
If you'd like access to our VM either to run your own comparisons or to help us
curate it and make it more representative of modern websites with diverse
languages and encodings, let me know.
> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Priority: Major
> Attachments: HtmlEncodingDetectorTest.java
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where
> HtmlEncodingDetector differs from the specification, and thus fails at
> detecting the right charset.
> I am attaching the test cases to this issue:
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)