[ 
https://issues.apache.org/jira/browse/TIKA-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051854#comment-16051854
 ] 

Tim Allison commented on TIKA-2396:
-----------------------------------

Encoding detection on short files is notoriously challenging.  With the 
exception some html meta-header reading code, Tika relies on other libraries 
(well, copied/pasted other libraries) for encoding detection.

I'm not sure there is much we can do about this.  You can configure a different 
order of charset detectors if that would help.

Aside from "get it right", what behavior would you like to see?  What can we 
improve?

> Unexpected charset detected for a plain text file by CharsetDetector
> --------------------------------------------------------------------
>
>                 Key: TIKA-2396
>                 URL: https://issues.apache.org/jira/browse/TIKA-2396
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>            Reporter: Gaurav Gupta
>         Attachments: test_Asset.txt
>
>
> Hi,
> The CharsetDetector seems to be incorrectly detecting IBM424_rtl charset with 
> maximum probability for the text file attached - [^test_Asset.txt] . 
> ISO-8859-9 has the second-best confidence value which ideally should have 
> first in the list.
> Versions being used:
> apache-core - 1.14.0
> apache-parsers-1.14.0
> Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to