[
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeremy McLain updated TIKA-1262:
--------------------------------
Description:
The code that demonstrates this bug can be found in attachment:
ChineseTextExtraction.java.
Observed behavior:
Tika incorrectly detects 'application/octet-stream' for the Content-Type and
returns an empty string for the contents.
Expected behavior:
It should detect 'text/plain' for the Content-Type and return a Unicode string
of the contents of the file.
Notes:
GB2312.txt is a plain text file containing some Chinese encoded with the GB2312
charset. GB2312 is a very common charset and encoding. Tika should be able to
handle this without any problems. In fact, the CharsetDetector class on its own
accurately detects the charset as GB18030 which is a super set of GB2312.
CharsetDetector.getString() handles converting the GB2312 bytes to Unicode just
fine. I don't understand why the Tika facade fails.
was:
The code that demonstrates this bug can be found in attachment:
ChineseTextExtraction.java
This code errantly detects 'application/octet-stream' for the Content-Type and
returns an empty string for the contents. It should detect 'text/plain' for the
Content-Type and return a Unicode string of the contents of the file.
GB2312 is a very common charset and encoding.
> parseToString fails to detect content-type / charset for GB2312 text file
> -------------------------------------------------------------------------
>
> Key: TIKA-1262
> URL: https://issues.apache.org/jira/browse/TIKA-1262
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.5
> Environment: Java 1.7; Windows 7 64 bit
> Reporter: Jeremy McLain
> Attachments: ChineseTextExtraction.java, GB2312.txt
>
>
> The code that demonstrates this bug can be found in attachment:
> ChineseTextExtraction.java.
> Observed behavior:
> Tika incorrectly detects 'application/octet-stream' for the Content-Type and
> returns an empty string for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be
> able to handle this without any problems. In fact, the CharsetDetector class
> on its own accurately detects the charset as GB18030 which is a super set of
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to
> Unicode just fine. I don't understand why the Tika facade fails.
--
This message was sent by Atlassian JIRA
(v6.2#6252)