[
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeremy McLain closed TIKA-1262.
-------------------------------
Resolution: Not A Problem
Fix Version/s: 1.5
See comment by Jukka Zitting.
> parseToString fails to detect content-type / charset
> ----------------------------------------------------
>
> Key: TIKA-1262
> URL: https://issues.apache.org/jira/browse/TIKA-1262
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.5
> Environment: Java 1.7; Windows 7 64 bit
> Reporter: Jeremy McLain
> Fix For: 1.5
>
> Attachments: ChineseTextExtraction.java, GB2312.txt,
> russian-koi8-r.txt
>
>
> The code that demonstrates this bug can be found in attachment:
> ChineseTextExtraction.java.
> Observed behavior:
> Tika.parseToString(InputStream, Metadata) incorrectly detects
> 'application/octet-stream' for the Content-Type and returns an empty string
> for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be
> able to handle this without any problems. In fact, the CharsetDetector class
> on its own accurately detects the charset as GB18030 which is a super set of
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to
> Unicode just fine. I don't understand why the Tika facade fails.
> Edit:
> I have the same issue with the file russian-koi8-r.txt. koi8-r is also a
> common charset. It appears that this isn't just a GB2312 issue. It seems to
> work fine with ISO-8859-1 (English) files.
--
This message was sent by Atlassian JIRA
(v6.2#6252)