[
https://issues.apache.org/jira/browse/TIKA-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-3479:
------------------------------
Summary: UniversalCharsetDetector in 2.x is misidentifying windows-1250 as
ISO-8859-1 (was: UniversalCharsetDetector in 2.x is misidentifying
windows-1252(?) as ISO-8859-1)
> UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1
> ----------------------------------------------------------------------------
>
> Key: TIKA-3479
> URL: https://issues.apache.org/jira/browse/TIKA-3479
> Project: Tika
> Issue Type: Task
> Affects Versions: 2.0.0-BETA
> Reporter: Tim Allison
> Priority: Minor
>
> We've lost quite a few "common words" for Czech and Slovak text files in 2.x
> vs. 1.x. The key issue appears to be the following (which we do not have in
> 1.x).
> {noformat}
> /*
> * hex value 0x81, 0x8d, 0x8f, 0x90 don't exist in charset windows-1252.
> * If these value's count > 0, return true
> * */
> private Boolean hasNonexistentHexInCharsetWindows1252() {
> return (statistics.count(0x81) > 0 || statistics.count(0x8d) > 0 ||
> statistics.count(0x8f) > 0 || statistics.count(0x90) > 0 ||
> statistics.count(0x9d) > 0);
> }
> {noformat}
> I _think_ the files are actually https://en.wikipedia.org/wiki/Code_page_852,
> and they do have these characters. windows-1252 is _generally_ a better batch
> for cp852 than ISO-8859-1.
> Not sure how best to handle this...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)