[
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2933:
------------------------------
Description:
Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
I'm finally getting around to running the comparisons between our legacy
HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More analysis
is required, but the newer one is, generally better*. One area for
improvement/explanation, though is in the "replacement" encoding.
* There are 1 million more "common words" in text extracted from files with the
StandardHtmlEncodingDetector than with only our legacy. There are 133M common
words in our legacy extracts so that's less than 1% improvement.
was:
Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
I'm finally getting around to running the comparisons between our legacy
HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More analysis
is required, but the newer one is, generally, much better. One area for
improvement/explanation, though is in the "replacement" encoding.
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
> Key: TIKA-2933
> URL: https://issues.apache.org/jira/browse/TIKA-2933
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
> Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More
> analysis is required, but the newer one is, generally better*. One area for
> improvement/explanation, though is in the "replacement" encoding.
> * There are 1 million more "common words" in text extracted from files with
> the StandardHtmlEncodingDetector than with only our legacy. There are 133M
> common words in our legacy extracts so that's less than 1% improvement.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)