[jira] [Commented] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Tim Allison (Jira) Fri, 30 Aug 2019 06:21:24 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919530#comment-16919530
 ]


Tim Allison commented on TIKA-2933:
-----------------------------------

And going the other way, this represents the number of files with an increase 
in common tokens in the _Standard_ encoding detector:
||Standard||Legacy||Number of Files with More Common Tokens in Standard||
|UTF-8|UTF-16|718|
|windows-1252|US-ASCII|528|
|GBK|GB2312|503|
|UTF-8|windows-1251|361|
|GBK|GB2312|275|
|UTF-8|UTF-16|181|
|UTF-8|EUC-KR|166|
|UTF-8|Big5|138|
|windows-1254|ISO-8859-9|70|
|UTF-8|Shift_JIS|67|
|windows-1254|ISO-8859-9|61|
|UTF-8|ISO-8859-2|58|
|UTF-8|ISO-8859-1|55|
|UTF-8|ISO-8859-9|55|
|UTF-8|windows-1251|52|
|UTF-8|ISO-8859-5|49|
|UTF-8|UTF-16BE|44|
|UTF-8|IBM852|43|
|windows-1252|US-ASCII|39|
|UTF-8|ISO-8859-1|29|
|UTF-8|US-ASCII|28|
|UTF-8|EUC-JP|25|
|UTF-8|windows-1252|24|
|UTF-8|windows-1254|22|
|UTF-8|ISO-2022-JP|22|
|UTF-8|windows-1252|20|
|UTF-8|ISO-8859-4|20|
|UTF-8|UTF-16LE|20|
|UTF-8|ISO-8859-9|19|
|UTF-8|IBM866|19|
|UTF-8|ISO-8859-15|18|
|UTF-8|GB2312|18|
|UTF-8|ISO-8859-3|16|
|UTF-8|ISO-8859-2|14|
|UTF-8|windows-1250|14|
|UTF-8|windows-1254|9|
|UTF-8|ISO-2022-KR|8|
|UTF-8|windows-1250|8|
|UTF-8|KOI8-R|8|
|UTF-8|ISO-8859-4|7|
|UTF-8|ISO-8859-7|7|
|UTF-8|windows-1256|7|
|UTF-8|windows-1256|7|
|UTF-8|EUC-JP|7|
|UTF-8|GB18030|6|
|UTF-8|GB2312|5|
|UTF-8|Big5|5|
|UTF-8|ISO-8859-5|4|
|UTF-8|ISO-2022-JP|4|
|UTF-8|US-ASCII|4|
|UTF-8|windows-1255|3|
|UTF-8|ISO-8859-13|3|
|UTF-8|GBK|3|
|UTF-8|ISO-8859-8|3|
|windows-1251|KOI8-R|3|
|Big5|Big5-HKSCS|3|
|Shift_JIS|windows-31j|3|
|UTF-8|UTF-16BE|2|
|UTF-8|x-iso-8859-11|2|
|windows-1252|ISO-8859-5|2|
|UTF-8|GB18030|2|
|UTF-8|x-IBM856|2|
|Shift_JIS|UTF-8|2|
|x-windows-874|x-iso-8859-11|2|
|UTF-8|ISO-8859-6|2|
|UTF-8|windows-1257|2|
|x-windows-874|TIS-620|1|
|windows-1251|ISO-8859-1|1|
|UTF-8|windows-1257|1|
|UTF-8|ISO-8859-6|1|
|UTF-8|windows-1258|1|
|UTF-8|x-windows-874|1|
|UTF-8|x-windows-950|1|
|windows-1252|IBM850|1|
|UTF-8|GBK|1|
|UTF-8|ISO-8859-3|1|

This represents the sum of the increase in common tokens in _Standard_ when 
there are more common tokens in Standard.
||Standard||Legacy||More Common Tokens in Standard||
|UTF-8|UTF-16|686487|
|UTF-8|UTF-16|55453|
|GBK|GB2312|54223|
|UTF-8|windows-1251|45808|
|UTF-8|UTF-16BE|42832|
|UTF-8|UTF-16LE|27341|
|UTF-8|GBK|21308|
|UTF-8|Big5|21042|
|GBK|GB2312|20580|
|UTF-8|EUC-KR|19064|
|UTF-8|ISO-2022-JP|17747|
|windows-1252|US-ASCII|6560|
|UTF-8|ISO-8859-9|5160|
|UTF-8|ISO-8859-2|5075|
|UTF-8|GB2312|4546|
|UTF-8|ISO-8859-9|4477|
|UTF-8|windows-1251|3522|
|UTF-8|ISO-8859-1|2211|
|UTF-8|ISO-8859-5|2099|
|UTF-8|ISO-8859-1|1985|
|UTF-8|windows-1256|1964|
|UTF-8|Shift_JIS|1930|
|UTF-8|ISO-8859-4|1766|
|UTF-8|GB2312|1593|
|UTF-8|windows-1250|1570|
|windows-1251|KOI8-R|1507|
|UTF-8|windows-1256|1422|
|UTF-8|windows-1254|1271|
|windows-1254|ISO-8859-9|1259|
|UTF-8|windows-1252|1064|
|UTF-8|ISO-8859-3|1001|
|UTF-8|windows-1252|909|
|UTF-8|US-ASCII|733|
|windows-1252|US-ASCII|731|
|UTF-8|UTF-16BE|719|
|windows-1254|ISO-8859-9|627|
|UTF-8|EUC-JP|596|
|x-windows-874|TIS-620|572|
|UTF-8|EUC-JP|541|
|UTF-8|windows-1255|442|
|UTF-8|ISO-8859-7|331|
|UTF-8|ISO-8859-2|322|
|UTF-8|KOI8-R|318|
|UTF-8|Big5|270|
|UTF-8|ISO-2022-JP|250|
|Shift_JIS|UTF-8|237|
|UTF-8|windows-1254|217|
|UTF-8|ISO-8859-13|156|
|UTF-8|ISO-8859-15|146|
|UTF-8|ISO-8859-6|140|
|UTF-8|windows-1250|135|
|UTF-8|IBM852|130|
|UTF-8|ISO-8859-4|127|
|UTF-8|GB18030|101|
|UTF-8|IBM866|75|
|windows-1251|ISO-8859-1|73|
|windows-1252|ISO-8859-5|38|
|UTF-8|x-IBM856|38|
|UTF-8|windows-1258|36|
|UTF-8|ISO-8859-8|30|
|UTF-8|ISO-8859-5|29|
|UTF-8|US-ASCII|26|
|windows-1252|IBM850|24|
|UTF-8|ISO-2022-KR|22|
|UTF-8|ISO-8859-6|21|
|UTF-8|GB18030|15|
|UTF-8|GBK|13|
|UTF-8|x-iso-8859-11|12|
|UTF-8|x-windows-874|11|
|UTF-8|windows-1257|6|
|UTF-8|windows-1257|5|
|x-windows-874|x-iso-8859-11|5|
|UTF-8|ISO-8859-3|4|
|Big5|Big5-HKSCS|3|
|Shift_JIS|windows-31j|3|
|UTF-8|x-windows-950|2|


> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2933
>                 URL: https://issues.apache.org/jira/browse/TIKA-2933
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy 
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More 
> analysis is required, but the newer one is, generally better*.  One area for 
> improvement/explanation, though is in the "replacement" encoding. 
> * There are 1 million more "common words" in text extracted from files with 
> the StandardHtmlEncodingDetector than with only our legacy.  There are 133M 
> common words in our legacy extracts so that's less than 1% improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Reply via email to