[jira] [Updated] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Tim Allison (Jira) Fri, 30 Aug 2019 06:43:34 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-2933:
------------------------------
    Issue Type: Improvement  (was: Bug)

> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2933
>                 URL: https://issues.apache.org/jira/browse/TIKA-2933
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy 
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More 
> analysis is required, but the newer one is, generally better*.  One area for 
> improvement/explanation, though is in the "replacement" encoding. 
> * There are 1 million more "common words" in text extracted from files with 
> the StandardHtmlEncodingDetector than with only our legacy.  There are 133M 
> common words in our legacy extracts so that's less than 1% improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Reply via email to