Tim Allison created TIKA-4185:
---------------------------------

             Summary: Study encoding detection (again) on html
                 Key: TIKA-4185
                 URL: https://issues.apache.org/jira/browse/TIKA-4185
             Project: Tika
          Issue Type: Wish
            Reporter: Tim Allison


We haven't looked at charset encoding detection in a while. It would be useful 
to get a fresh pull of htmls from CommonCrawl and compare different encoding 
detection methods.

Separately, on a separate ticket, we can also run a comparison between tagsoup 
and jsoup (tika 2x vs 3x) to see where we are.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to