[jira] [Commented] (TIKA-3516) Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector

Tim Allison (Jira) Fri, 06 Aug 2021 15:02:04 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394999#comment-17394999
 ]


Tim Allison commented on TIKA-3516:
-----------------------------------

Thank you for linking this to the earlier issue.  It isn't clear to me what the 
correct fix for this is: 
https://issues.apache.org/jira/browse/TIKA-2396?focusedCommentId=16051854&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16051854

Obviously, "get it right" is the desired outcome, but charset detection relying 
on probabilistic models will be wrong, especially on short files.

You can change the order of charset detectors if that would help you for your 
use cases?


> Unexpected charset IBM424_rtl detected for  utf_8  file by CharsetDetector
> --------------------------------------------------------------------------
>
>                 Key: TIKA-3516
>                 URL: https://issues.apache.org/jira/browse/TIKA-3516
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>            Reporter: Chaitra Rajappa
>            Priority: Major
>
> Hi,
>  The CharsetDetector detects the wrong charset for a file as IBM424_rtl. 
>  Resulting in exception 
> *_java.nio.charset.UnsupportedCharsetException: IBM424_rtl 17 at 
> java.nio.charset.Charset.forName(Charset.java:531)_*
> I see there is also an existing ticket with the same issue thats not been 
> fixed.
> https://issues.apache.org/jira/browse/TIKA-2396
>  Please suggest the changes to fix this. 
> Versions being used:
> apache-core - 1.20
> apache-parsers-1.20
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3516) Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector

Reply via email to