[jira] [Commented] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1

Tim Allison (Jira) Wed, 14 Jul 2021 09:17:04 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380719#comment-17380719
 ]


Tim Allison commented on TIKA-3479:
-----------------------------------

Experimenting with changing this behavior reveals too many unexpected side 
effects.  Let's push this off until after the 2.x release.  As stated, 
detection is broken in 1.x and 2.x, only slightly more so in 2.x for win-1250.

> UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-3479
>                 URL: https://issues.apache.org/jira/browse/TIKA-3479
>             Project: Tika
>          Issue Type: Task
>    Affects Versions: 2.0.0-BETA
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: Bates.Motel.S02E08.HDTV.x264-KILLERS.srt
>
>
> We've lost quite a few "common words" for Czech and Slovak text files in 2.x 
> vs. 1.x.  The key issue appears to be the following (which we do not have in 
> 1.x).
> {noformat}
>     /*
>      * hex value 0x81, 0x8d, 0x8f, 0x90 don't exist in charset windows-1252.
>      * If these value's count > 0, return true
>      * */
>     private Boolean hasNonexistentHexInCharsetWindows1252() {
>         return (statistics.count(0x81) > 0 || statistics.count(0x8d) > 0 ||
>                 statistics.count(0x8f) > 0 || statistics.count(0x90) > 0 ||
>                 statistics.count(0x9d) > 0);
>     }
> {noformat}
> The icu4j detector detects windows-1250 (not supported by the 
> UniversalEncodingDetector), and the characters decoded with encoding do 
> better on google. windows-1252 is _generally_ a better match for windows-1250 
> than ISO-8859-1.
> Not sure how best to handle this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1

Reply via email to