[ 
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919644#comment-16919644
 ] 

Tim Allison commented on TIKA-2933:
-----------------------------------

One example of the ISO-2022-KR files: 
http://162.242.228.174/docs/commoncrawl3/4R/4RFVUBFP7VWAUUBAGLQNJPLXOAXNZSZH

> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2933
>                 URL: https://issues.apache.org/jira/browse/TIKA-2933
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy 
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More 
> analysis is required, but the newer one is, generally better*.  One area for 
> improvement/explanation, though is in the "replacement" encoding. 
> * There are 1 million more "common words" in text extracted from files with 
> the StandardHtmlEncodingDetector than with only our legacy.  There are 133M 
> common words in our legacy extracts so that's less than 1% improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to