[
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919502#comment-16919502
]
Tim Allison commented on TIKA-2933:
-----------------------------------
[~gbouchar]|[~Gerard Bouchar], in our test corpus, I see the following:
||Standard||Legacy||Count||
|text/html; charset=replacement|text/html; charset=ISO-2022-KR|122|
|text/html; charset=replacement|text/html; charset=ISO-2022-CN|14|
|application/xhtml+xml; charset=replacement|application/xhtml+xml;
charset=ISO-2022-KR|11|
I've looked at a small handful of these files, and the ISO-2022-KR files really
are ISO-2022-KR.
Is there a reason we need to keep the mappings that you included here:
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java#L115
How do we know/is it standard to ignore the tags for those charset types?
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
> Key: TIKA-2933
> URL: https://issues.apache.org/jira/browse/TIKA-2933
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
> Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More
> analysis is required, but the newer one is, generally, much better. One area
> for improvement/explanation, though is in the "replacement" encoding.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)