[ 
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919519#comment-16919519
 ] 

Tim Allison commented on TIKA-2933:
-----------------------------------

This table includes the number of files with more "common words" in the 
extracts from the Legacy html encoding detector:
||Standard||Legacy||Number of files||
|windows-1252|US-ASCII|442|
|GBK|GB2312|163|
|windows-1252|US-ASCII|130|
|replacement|ISO-2022-KR|119|
|GBK|GB2312|55|
|UTF-8|ISO-8859-1|52|
|windows-1254|ISO-8859-9|51|
|Big5|Big5-HKSCS|32|
|replacement|ISO-2022-CN|14|
|replacement|ISO-2022-KR|11|
|windows-1254|ISO-8859-9|10|
|Big5|Big5-HKSCS|8|
|UTF-8|windows-1250|5|
|UTF-8|EUC-JP|4|
|x-windows-874|TIS-620|4|
|UTF-8|Shift_JIS|3|
|UTF-8|ISO-8859-4|2|
|UTF-8|Big5|2|
|UTF-8|EUC-JP|2|
|UTF-8|UTF-16|2|
|UTF-8|windows-1251|2|
|UTF-8|windows-1252|2|
|x-windows-874|TIS-620|1|
|ISO-8859-4|ISO-8859-1|1|
|UTF-8|TIS-620|1|
|UTF-8|Big5-HKSCS|1|
|UTF-8|UTF-16|1|
|Shift_JIS|windows-31j|1|
|UTF-8|GB2312|1|
|UTF-8|windows-1256|1|
|UTF-8|windows-1254|1|
|UTF-8|ISO-8859-9|1|
|UTF-8|ISO-8859-2|1|

> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2933
>                 URL: https://issues.apache.org/jira/browse/TIKA-2933
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy 
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More 
> analysis is required, but the newer one is, generally better*.  One area for 
> improvement/explanation, though is in the "replacement" encoding. 
> * There are 1 million more "common words" in text extracted from files with 
> the StandardHtmlEncodingDetector than with only our legacy.  There are 133M 
> common words in our legacy extracts so that's less than 1% improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to