[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570587#comment-16570587
]
Tim Allison commented on TIKA-2673:
-----------------------------------
[~gbouchar], On the evaluation, it looks like 3 of the files have the same
urls: 105,956, but {{segment_big_chrome_charsets.jsonl.xz}} has ~200k...
Should I ignore that one? Second point on the evaluation, I really like how
you classified "correct", "similar" and "wrong"...this continues to be an
ongoing pain, but it is necessary.
bq. I think most people want an encoding detector that "just works" by default.
Y, I agree. My thinking is that if we migrate to the newer detector, we'd
specify it correctly in the SPI file as we do now with html->universal->icu4j.
That would then be "just works" by default. Until that point, though, users
would have to specify the newer detector, and we can show them that they ought
to include icu4j after the newer detector... Let me think about this some more.
bq. I can make a pull request for a separate encoding detector using only the
BOM.
I don't feel strongly about this. Let's wait to see if there's a need. Thank
you!
> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Assignee: Tim Allison
> Priority: Major
> Fix For: 1.19, 2.0.0
>
> Attachments: HtmlEncodingDetectorTest.java,
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where
> HtmlEncodingDetector differs from the specification, and thus fails at
> detecting the right charset.
> I am attaching the test cases to this issue:
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)