[jira] [Resolved] (TIKA-688) Enhance content-type detector to recognize almost plain text

Jukka Zitting (JIRA) Sat, 17 Sep 2011 04:53:35 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-688.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Implemented in revision 1171952 by allowing text content that contains up to 2% 
control characters and up to 10% non-ASCII characters.

> Enhance content-type detector to recognize almost plain text
> ------------------------------------------------------------
>
>                 Key: TIKA-688
>                 URL: https://issues.apache.org/jira/browse/TIKA-688
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 0.9
>            Reporter: Chris Lott
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.10
>
>
> I am using TIKA to convert a collection of documents that includes files 
> named something.txt.  I use the Tika#parse(InputStream) interface to get a 
> parser that auto detects content.  The files are almost plain text -- the 
> documents have a scattering of control characters in them.  On these text 
> files the reader given to me by the Tika#parse() method immediately returns 
> null.  After some experimentation I found that a single control K character 
> early in the file will cause the mime type detector to give up and label it 
> application/octet-stream.  Please consider adding a recognizer because it 
> would be great if Tika could clean up these files by dropping text 
> characters.  I note that if I drop this file into the Tika GUI, or if I 
> invoke Tika on the command line it does well, and I think this behavior is 
> obtained by using the file name as a hint.  I probably should be using a 
> different Tika method, trying to figure that out next.  Thanks for listening.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-688) Enhance content-type detector to recognize almost plain text

Reply via email to