[jira] [Comment Edited] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Tim Allison (Jira) Fri, 30 Aug 2019 06:11:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919519#comment-16919519
 ]


Tim Allison edited comment on TIKA-2933 at 8/30/19 1:09 PM:
------------------------------------------------------------

This table includes the number of files with more "common words" in the 
extracts from the Legacy html encoding detector.  There are double entries for 
application/xhtml and text/html... but this should be enough to give an 
indication. 

||Standard||Legacy||Number of files||
|windows-1252|US-ASCII|442|
|GBK|GB2312|163|
|windows-1252|US-ASCII|130|
|replacement|ISO-2022-KR|119|
|GBK|GB2312|55|
|UTF-8|ISO-8859-1|52|
|windows-1254|ISO-8859-9|51|
|Big5|Big5-HKSCS|32|
|replacement|ISO-2022-CN|14|
|replacement|ISO-2022-KR|11|
|windows-1254|ISO-8859-9|10|
|Big5|Big5-HKSCS|8|
|UTF-8|windows-1250|5|
|UTF-8|EUC-JP|4|
|x-windows-874|TIS-620|4|
|UTF-8|Shift_JIS|3|
|UTF-8|ISO-8859-4|2|
|UTF-8|Big5|2|
|UTF-8|EUC-JP|2|
|UTF-8|UTF-16|2|
|UTF-8|windows-1251|2|
|UTF-8|windows-1252|2|
|x-windows-874|TIS-620|1|
|ISO-8859-4|ISO-8859-1|1|
|UTF-8|TIS-620|1|
|UTF-8|Big5-HKSCS|1|
|UTF-8|UTF-16|1|
|Shift_JIS|windows-31j|1|
|UTF-8|GB2312|1|
|UTF-8|windows-1256|1|
|UTF-8|windows-1254|1|
|UTF-8|ISO-8859-9|1|
|UTF-8|ISO-8859-2|1|

And this table counts the number of increased common tokens in Legacy (when 
there is an increase):

||Standard||Legacy||More Common Tokens in Legacy||
|replacement|ISO-2022-KR|32514|
|GBK|GB2312|5018|
|windows-1252|US-ASCII|2340|
|replacement|ISO-2022-KR|1984
|GBK|GB2312|1688|
|windows-1252|US-ASCII|1619|
|Big5|Big5-HKSCS|1543|
|replacement|ISO-2022-CN|1429|
|UTF-8|windows-1256|1057|
|UTF-8|windows-1250|768|
|UTF-8|ISO-8859-1|254|
|UTF-8|ISO-8859-2|234|
|UTF-8|windows-1251|226|
|UTF-8|EUC-JP|176|
|windows-1254|ISO-8859-9|98|
|x-windows-874|TIS-620|78|
|UTF-8|EUC-JP|63|
|UTF-8|Big5|52|
|Big5|Big5-HKSCS|32|
|UTF-8|Shift_JIS|28|
|UTF-8|windows-1254|27|
|windows-1254|ISO-8859-9|16|
|UTF-8|GB2312|15|
|UTF-8|windows-1252|14|
|UTF-8|UTF-16|14|
|UTF-8|ISO-8859-9|8|
|UTF-8|ISO-8859-4|4|
|ISO-8859-4|ISO-8859-1|2|
|Shift_JIS|windows-31j|2|
|UTF-8|TIS-620|1|
|x-windows-874|TIS-620|1|
|UTF-8|UTF-16|1|
|UTF-8|Big5-HKSCS|1|



was (Author: [email protected]):
This table includes the number of files with more "common words" in the 
extracts from the Legacy html encoding detector:
||Standard||Legacy||Number of files||
|windows-1252|US-ASCII|442|
|GBK|GB2312|163|
|windows-1252|US-ASCII|130|
|replacement|ISO-2022-KR|119|
|GBK|GB2312|55|
|UTF-8|ISO-8859-1|52|
|windows-1254|ISO-8859-9|51|
|Big5|Big5-HKSCS|32|
|replacement|ISO-2022-CN|14|
|replacement|ISO-2022-KR|11|
|windows-1254|ISO-8859-9|10|
|Big5|Big5-HKSCS|8|
|UTF-8|windows-1250|5|
|UTF-8|EUC-JP|4|
|x-windows-874|TIS-620|4|
|UTF-8|Shift_JIS|3|
|UTF-8|ISO-8859-4|2|
|UTF-8|Big5|2|
|UTF-8|EUC-JP|2|
|UTF-8|UTF-16|2|
|UTF-8|windows-1251|2|
|UTF-8|windows-1252|2|
|x-windows-874|TIS-620|1|
|ISO-8859-4|ISO-8859-1|1|
|UTF-8|TIS-620|1|
|UTF-8|Big5-HKSCS|1|
|UTF-8|UTF-16|1|
|Shift_JIS|windows-31j|1|
|UTF-8|GB2312|1|
|UTF-8|windows-1256|1|
|UTF-8|windows-1254|1|
|UTF-8|ISO-8859-9|1|
|UTF-8|ISO-8859-2|1|

> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2933
>                 URL: https://issues.apache.org/jira/browse/TIKA-2933
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy 
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More 
> analysis is required, but the newer one is, generally better*.  One area for 
> improvement/explanation, though is in the "replacement" encoding. 
> * There are 1 million more "common words" in text extracted from files with 
> the StandardHtmlEncodingDetector than with only our legacy.  There are 133M 
> common words in our legacy extracts so that's less than 1% improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (TIKA-2933) Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

Reply via email to