[jira] [Commented] (TIKA-1262) parseToString fails to detect content-type / charset

Jukka Zitting (JIRA) Wed, 19 Mar 2014 19:54:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941336#comment-13941336
 ]


Jukka Zitting commented on TIKA-1262:
-------------------------------------

The {{CharsetDetector}} class detects the character encoding based on the 
assumption that the given input is already known to be plain text in some 
encoding. Unfortunately this assumption won't hold for methods like 
{{Tika.parseToString()}} that need to also deal with binary file formats, which 
is why we currently can't auto-detect such documents.

What you could do here is pass the filename as input metadata to the parser, in 
which case it could make the assumption about the file being plain text. The 
easiest way to do this would be to use the {{TikaInputStream.get()}} factory 
method to collect the input metadata, like this:

{code}
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
{code}

(note the extra {{metadata}} argument)

More generally it would be possible to extend the existing {{TextStatistics}} 
class with information about the byte patterns used by the EUC-CN  (and KOI8-R) 
encoding, ideally with character usage statistics like 
http://www.zein.se/patrick/3000char.html (and 
http://www.sttmedia.com/characterfrequency-russian) to make the heuristics more 
accurate. With such information the {{TextDetector}} class should be able to 
detect more encodings than just the ASCII-based ones and UTF-8 it currently 
knows about.

> parseToString fails to detect content-type / charset
> ----------------------------------------------------
>
>                 Key: TIKA-1262
>                 URL: https://issues.apache.org/jira/browse/TIKA-1262
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>         Environment: Java 1.7; Windows 7 64 bit
>            Reporter: Jeremy McLain
>         Attachments: ChineseTextExtraction.java, GB2312.txt, 
> russian-koi8-r.txt
>
>
> The code that demonstrates this bug can be found in attachment: 
> ChineseTextExtraction.java. 
> Observed behavior:
> Tika.parseToString(InputStream, Metadata) incorrectly detects 
> 'application/octet-stream' for the Content-Type and returns an empty string 
> for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode 
> string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the 
> GB2312 charset. GB2312 is a very common charset and encoding. Tika should be 
> able to handle this without any problems. In fact, the CharsetDetector class 
> on its own accurately detects the charset as GB18030 which is a super set of 
> GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to 
> Unicode just fine. I don't understand why the Tika facade fails.
> Edit:
> I have the same issue with the file russian-koi8-r.txt. koi8-r is also a 
> common charset. It appears that this isn't just a GB2312 issue. It seems to 
> work fine with ISO-8859-1 (English) files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1262) parseToString fails to detect content-type / charset

Reply via email to