[jira] [Commented] (TIKA-1050) Charset detection gives wrong results for GB18030 encoding

Nick Burch (JIRA) Thu, 31 Jan 2013 16:13:14 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13568267#comment-13568267
 ]


Nick Burch commented on TIKA-1050:
----------------------------------

Charset detection generally works best if you give it a few kb of data to work 
on - it's all statistics based (n-grams), and a very short snippet generally 
isn't representative

Do you have the same problem with a slightly longer block of text? If so, any 
chance you could upload a new sample file that's something like 2-3kb that we 
could use to test with?
                
> Charset detection gives wrong results for GB18030 encoding
> ----------------------------------------------------------
>
>                 Key: TIKA-1050
>                 URL: https://issues.apache.org/jira/browse/TIKA-1050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>            Reporter: Amit Gupta
>            Priority: Critical
>         Attachments: Test data-GB.txt
>
>
> CharsetDetector gives IBM866 as the charset for text file that is in GB18030.
> GB18030 gets a lower confidence than IBM866.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1050) Charset detection gives wrong results for GB18030 encoding

Reply via email to