[
https://issues.apache.org/jira/browse/TIKA-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051854#comment-16051854
]
Tim Allison commented on TIKA-2396:
-----------------------------------
Encoding detection on short files is notoriously challenging. With the
exception some html meta-header reading code, Tika relies on other libraries
(well, copied/pasted other libraries) for encoding detection.
I'm not sure there is much we can do about this. You can configure a different
order of charset detectors if that would help.
Aside from "get it right", what behavior would you like to see? What can we
improve?
> Unexpected charset detected for a plain text file by CharsetDetector
> --------------------------------------------------------------------
>
> Key: TIKA-2396
> URL: https://issues.apache.org/jira/browse/TIKA-2396
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Reporter: Gaurav Gupta
> Attachments: test_Asset.txt
>
>
> Hi,
> The CharsetDetector seems to be incorrectly detecting IBM424_rtl charset with
> maximum probability for the text file attached - [^test_Asset.txt] .
> ISO-8859-9 has the second-best confidence value which ideally should have
> first in the list.
> Versions being used:
> apache-core - 1.14.0
> apache-parsers-1.14.0
> Thanks
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)