[ 
https://issues.apache.org/jira/browse/TIKA-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919502#comment-16919502
 ] 

Tim Allison commented on TIKA-2933:
-----------------------------------

[~gbouchar]|[~Gerard Bouchar], in our test corpus, I see the following:

||Standard||Legacy||Count||
|text/html; charset=replacement|text/html; charset=ISO-2022-KR|122|
|text/html; charset=replacement|text/html; charset=ISO-2022-CN|14|
|application/xhtml+xml; charset=replacement|application/xhtml+xml; 
charset=ISO-2022-KR|11|

I've looked at a small handful of these files, and the ISO-2022-KR files really 
are ISO-2022-KR.

Is there a reason we need to keep the mappings that you included here: 
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java#L115

How do we know/is it standard to ignore the tags for those charset types?

> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2933
>                 URL: https://issues.apache.org/jira/browse/TIKA-2933
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Major
>
> Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
> I'm finally getting around to running the comparisons between our legacy 
> HTMLEncodingDetector and the newer StandardHTMLEncodingDetector.  More 
> analysis is required, but the newer one is, generally, much better.  One area 
> for improvement/explanation, though is in the "replacement" encoding. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to