[
https://issues.apache.org/jira/browse/TIKA-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394999#comment-17394999
]
Tim Allison commented on TIKA-3516:
-----------------------------------
Thank you for linking this to the earlier issue. It isn't clear to me what the
correct fix for this is:
https://issues.apache.org/jira/browse/TIKA-2396?focusedCommentId=16051854&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16051854
Obviously, "get it right" is the desired outcome, but charset detection relying
on probabilistic models will be wrong, especially on short files.
You can change the order of charset detectors if that would help you for your
use cases?
> Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector
> --------------------------------------------------------------------------
>
> Key: TIKA-3516
> URL: https://issues.apache.org/jira/browse/TIKA-3516
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Reporter: Chaitra Rajappa
> Priority: Major
>
> Hi,
> The CharsetDetector detects the wrong charset for a file as IBM424_rtl.
> Resulting in exception
> *_java.nio.charset.UnsupportedCharsetException: IBM424_rtl 17 at
> java.nio.charset.Charset.forName(Charset.java:531)_*
> I see there is also an existing ticket with the same issue thats not been
> fixed.
> https://issues.apache.org/jira/browse/TIKA-2396
> Please suggest the changes to fix this.
> Versions being used:
> apache-core - 1.20
> apache-parsers-1.20
> Thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)