[
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2933:
------------------------------
Issue Type: Improvement (was: Bug)
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
> Key: TIKA-2933
> URL: https://issues.apache.org/jira/browse/TIKA-2933
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More
> analysis is required, but the newer one is, generally better*. One area for
> improvement/explanation, though is in the "replacement" encoding.
> * There are 1 million more "common words" in text extracted from files with
> the StandardHtmlEncodingDetector than with only our legacy. There are 133M
> common words in our legacy extracts so that's less than 1% improvement.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)