[jira] [Updated] (TIKA-1262) parseToString fails to detect content-type / charset for GB2312 text file

Jeremy McLain (JIRA) Wed, 19 Mar 2014 16:51:54 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeremy McLain updated TIKA-1262:
--------------------------------

    Description: 
The code that demonstrates this bug can be found in attachment: 
ChineseTextExtraction.java. 

Observed behavior:
Tika incorrectly detects 'application/octet-stream' for the Content-Type and 
returns an empty string for the contents.

Expected behavior:
It should detect 'text/plain' for the Content-Type and return a Unicode string 
of the contents of the file.

Notes:
GB2312.txt is a plain text file containing some Chinese encoded with the GB2312 
charset. GB2312 is a very common charset and encoding. Tika should be able to 
handle this without any problems. In fact, the CharsetDetector class on its own 
accurately detects the charset as GB18030 which is a super set of GB2312. 
CharsetDetector.getString() handles converting the GB2312 bytes to Unicode just 
fine. I don't understand why the Tika facade fails.

  was:
The code that demonstrates this bug can be found in attachment: 
ChineseTextExtraction.java

This code errantly detects 'application/octet-stream' for the Content-Type and 
returns an empty string for the contents. It should detect 'text/plain' for the 
Content-Type and return a Unicode string of the contents of the file.

GB2312 is a very common charset and encoding.


> parseToString fails to detect content-type / charset for GB2312 text file
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1262
>                 URL: https://issues.apache.org/jira/browse/TIKA-1262
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>         Environment: Java 1.7; Windows 7 64 bit
>            Reporter: Jeremy McLain
>         Attachments: ChineseTextExtraction.java, GB2312.txt
>
>
> The code that demonstrates this bug can be found in attachment: 
> ChineseTextExtraction.java. 
> Observed behavior:
> Tika incorrectly detects 'application/octet-stream' for the Content-Type and 
> returns an empty string for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode 
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the 
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be 
> able to handle this without any problems. In fact, the CharsetDetector class 
> on its own accurately detects the charset as GB18030 which is a super set of 
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to 
> Unicode just fine. I don't understand why the Tika facade fails.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1262) parseToString fails to detect content-type / charset for GB2312 text file

Reply via email to