[jira] [Updated] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Tim Allison (Jira) Fri, 30 Aug 2019 05:50:07 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-2933:
------------------------------
    Description: 
Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

I'm finally getting around to running the comparisons between our legacy 
HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More analysis 
is required, but the newer one is, generally better*.  One area for 
improvement/explanation, though is in the "replacement" encoding. 

* There are 1 million more "common words" in text extracted from files with the 
StandardHtmlEncodingDetector than with only our legacy.  There are 133M common 
words in our legacy extracts so that's less than 1% improvement.

  was:
Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

I'm finally getting around to running the comparisons between our legacy 
HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More analysis 
is required, but the newer one is, generally, much better.  One area for 
improvement/explanation, though is in the "replacement" encoding. 


> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2933
>                 URL: https://issues.apache.org/jira/browse/TIKA-2933
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy 
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More 
> analysis is required, but the newer one is, generally better*.  One area for 
> improvement/explanation, though is in the "replacement" encoding. 
> * There are 1 million more "common words" in text extracted from files with 
> the StandardHtmlEncodingDetector than with only our legacy.  There are 133M 
> common words in our legacy extracts so that's less than 1% improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Reply via email to