[ 
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy McLain updated TIKA-1262:
--------------------------------

    Description: 
The code that demonstrates this bug can be found in attachment: 
ChineseTextExtraction.java. 

Observed behavior:
Tika.parseToString(InputStream, Metadata) incorrectly detects 
'application/octet-stream' for the Content-Type and returns an empty string for 
the contents.

Expected behavior:
It should detect 'text/plain' for the Content-Type and return a Unicode string 
of the contents of the file.

Notes:
GB2312.txt is a plain text file containing some Chinese encoded with the GB2312 
charset. GB2312 is a very common charset and encoding. Tika should be able to 
handle this without any problems. In fact, the CharsetDetector class on its own 
accurately detects the charset as GB18030 which is a super set of GB2312. 
CharsetDetector.getString() handles converting the GB2312 bytes to Unicode just 
fine. I don't understand why the Tika facade fails.

Edit:
I have the same issue with the file russian-koi8-r.txt. koi8-r is also a common 
charset. It appears that this isn't just a GB2312 issue. It seems to work fine 
with ISO-8859-1 (English) files.

  was:
The code that demonstrates this bug can be found in attachment: 
ChineseTextExtraction.java. 

Observed behavior:
Tika.parseToString(InputStream, Metadata) incorrectly detects 
'application/octet-stream' for the Content-Type and returns an empty string for 
the contents.

Expected behavior:
It should detect 'text/plain' for the Content-Type and return a Unicode string 
of the contents of the file.

Notes:
GB2312.txt is a plain text file containing some Chinese encoded with the GB2312 
charset. GB2312 is a very common charset and encoding. Tika should be able to 
handle this without any problems. In fact, the CharsetDetector class on its own 
accurately detects the charset as GB18030 which is a super set of GB2312. 
CharsetDetector.getString() handles converting the GB2312 bytes to Unicode just 
fine. I don't understand why the Tika facade fails.


> parseToString fails to detect content-type / charset for GB2312 text file
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1262
>                 URL: https://issues.apache.org/jira/browse/TIKA-1262
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>         Environment: Java 1.7; Windows 7 64 bit
>            Reporter: Jeremy McLain
>         Attachments: ChineseTextExtraction.java, GB2312.txt, 
> russian-koi8-r.txt
>
>
> The code that demonstrates this bug can be found in attachment: 
> ChineseTextExtraction.java. 
> Observed behavior:
> Tika.parseToString(InputStream, Metadata) incorrectly detects 
> 'application/octet-stream' for the Content-Type and returns an empty string 
> for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode 
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the 
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be 
> able to handle this without any problems. In fact, the CharsetDetector class 
> on its own accurately detects the charset as GB18030 which is a super set of 
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to 
> Unicode just fine. I don't understand why the Tika facade fails.
> Edit:
> I have the same issue with the file russian-koi8-r.txt. koi8-r is also a 
> common charset. It appears that this isn't just a GB2312 issue. It seems to 
> work fine with ISO-8859-1 (English) files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to