[
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919519#comment-16919519
]
Tim Allison commented on TIKA-2933:
-----------------------------------
This table includes the number of files with more "common words" in the
extracts from the Legacy html encoding detector:
||Standard||Legacy||Number of files||
|windows-1252|US-ASCII|442|
|GBK|GB2312|163|
|windows-1252|US-ASCII|130|
|replacement|ISO-2022-KR|119|
|GBK|GB2312|55|
|UTF-8|ISO-8859-1|52|
|windows-1254|ISO-8859-9|51|
|Big5|Big5-HKSCS|32|
|replacement|ISO-2022-CN|14|
|replacement|ISO-2022-KR|11|
|windows-1254|ISO-8859-9|10|
|Big5|Big5-HKSCS|8|
|UTF-8|windows-1250|5|
|UTF-8|EUC-JP|4|
|x-windows-874|TIS-620|4|
|UTF-8|Shift_JIS|3|
|UTF-8|ISO-8859-4|2|
|UTF-8|Big5|2|
|UTF-8|EUC-JP|2|
|UTF-8|UTF-16|2|
|UTF-8|windows-1251|2|
|UTF-8|windows-1252|2|
|x-windows-874|TIS-620|1|
|ISO-8859-4|ISO-8859-1|1|
|UTF-8|TIS-620|1|
|UTF-8|Big5-HKSCS|1|
|UTF-8|UTF-16|1|
|Shift_JIS|windows-31j|1|
|UTF-8|GB2312|1|
|UTF-8|windows-1256|1|
|UTF-8|windows-1254|1|
|UTF-8|ISO-8859-9|1|
|UTF-8|ISO-8859-2|1|
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
> Key: TIKA-2933
> URL: https://issues.apache.org/jira/browse/TIKA-2933
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
> Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More
> analysis is required, but the newer one is, generally better*. One area for
> improvement/explanation, though is in the "replacement" encoding.
> * There are 1 million more "common words" in text extracted from files with
> the StandardHtmlEncodingDetector than with only our legacy. There are 133M
> common words in our legacy extracts so that's less than 1% improvement.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)