[ 
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy McLain closed TIKA-1262.
-------------------------------

       Resolution: Not A Problem
    Fix Version/s: 1.5

See comment by Jukka Zitting.

> parseToString fails to detect content-type / charset
> ----------------------------------------------------
>
>                 Key: TIKA-1262
>                 URL: https://issues.apache.org/jira/browse/TIKA-1262
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>         Environment: Java 1.7; Windows 7 64 bit
>            Reporter: Jeremy McLain
>             Fix For: 1.5
>
>         Attachments: ChineseTextExtraction.java, GB2312.txt, 
> russian-koi8-r.txt
>
>
> The code that demonstrates this bug can be found in attachment: 
> ChineseTextExtraction.java. 
> Observed behavior:
> Tika.parseToString(InputStream, Metadata) incorrectly detects 
> 'application/octet-stream' for the Content-Type and returns an empty string 
> for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode 
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the 
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be 
> able to handle this without any problems. In fact, the CharsetDetector class 
> on its own accurately detects the charset as GB18030 which is a super set of 
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to 
> Unicode just fine. I don't understand why the Tika facade fails.
> Edit:
> I have the same issue with the file russian-koi8-r.txt. koi8-r is also a 
> common charset. It appears that this isn't just a GB2312 issue. It seems to 
> work fine with ISO-8859-1 (English) files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to