Tim Allison created TIKA-4185:
---------------------------------
Summary: Study encoding detection (again) on html
Key: TIKA-4185
URL: https://issues.apache.org/jira/browse/TIKA-4185
Project: Tika
Issue Type: Wish
Reporter: Tim Allison
We haven't looked at charset encoding detection in a while. It would be useful
to get a fresh pull of htmls from CommonCrawl and compare different encoding
detection methods.
Separately, on a separate ticket, we can also run a comparison between tagsoup
and jsoup (tika 2x vs 3x) to see where we are.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)