[jira] [Comment Edited] (TIKA-1262) parseToString fails to detect content-type / charset for GB2312 text file

Jeremy McLain (JIRA) Wed, 19 Mar 2014 17:30:06 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941236#comment-13941236
 ]


Jeremy McLain edited comment on TIKA-1262 at 3/20/14 12:27 AM:
---------------------------------------------------------------

I have the same issue with the file russian-koi8-r.txt. koi8-r is also a common 
charset. It appears that this isn't just a GB2312 issue.


was (Author: gongchengshi):
I have the same issue with the file russian-koi8-r.txt. koi8-r is also a common 
charset.

> parseToString fails to detect content-type / charset for GB2312 text file
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1262
>                 URL: https://issues.apache.org/jira/browse/TIKA-1262
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>         Environment: Java 1.7; Windows 7 64 bit
>            Reporter: Jeremy McLain
>         Attachments: ChineseTextExtraction.java, GB2312.txt, 
> russian-koi8-r.txt
>
>
> The code that demonstrates this bug can be found in attachment: 
> ChineseTextExtraction.java. 
> Observed behavior:
> Tika.parseToString(InputStream, Metadata) incorrectly detects 
> 'application/octet-stream' for the Content-Type and returns an empty string 
> for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode 
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the 
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be 
> able to handle this without any problems. In fact, the CharsetDetector class 
> on its own accurately detects the charset as GB18030 which is a super set of 
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to 
> Unicode just fine. I don't understand why the Tika facade fails.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1262) parseToString fails to detect content-type / charset for GB2312 text file

Reply via email to