[
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941236#comment-13941236
]
Jeremy McLain edited comment on TIKA-1262 at 3/20/14 12:27 AM:
---------------------------------------------------------------
I have the same issue with the file russian-koi8-r.txt. koi8-r is also a common
charset. It appears that this isn't just a GB2312 issue.
was (Author: gongchengshi):
I have the same issue with the file russian-koi8-r.txt. koi8-r is also a common
charset.
> parseToString fails to detect content-type / charset for GB2312 text file
> -------------------------------------------------------------------------
>
> Key: TIKA-1262
> URL: https://issues.apache.org/jira/browse/TIKA-1262
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.5
> Environment: Java 1.7; Windows 7 64 bit
> Reporter: Jeremy McLain
> Attachments: ChineseTextExtraction.java, GB2312.txt,
> russian-koi8-r.txt
>
>
> The code that demonstrates this bug can be found in attachment:
> ChineseTextExtraction.java.
> Observed behavior:
> Tika.parseToString(InputStream, Metadata) incorrectly detects
> 'application/octet-stream' for the Content-Type and returns an empty string
> for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be
> able to handle this without any problems. In fact, the CharsetDetector class
> on its own accurately detects the charset as GB18030 which is a super set of
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to
> Unicode just fine. I don't understand why the Tika facade fails.
--
This message was sent by Atlassian JIRA
(v6.2#6252)