[
https://issues.apache.org/jira/browse/TIKA-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-688.
--------------------------------
Resolution: Fixed
Assignee: Jukka Zitting
Implemented in revision 1171952 by allowing text content that contains up to 2%
control characters and up to 10% non-ASCII characters.
> Enhance content-type detector to recognize almost plain text
> ------------------------------------------------------------
>
> Key: TIKA-688
> URL: https://issues.apache.org/jira/browse/TIKA-688
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Affects Versions: 0.9
> Reporter: Chris Lott
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.10
>
>
> I am using TIKA to convert a collection of documents that includes files
> named something.txt. I use the Tika#parse(InputStream) interface to get a
> parser that auto detects content. The files are almost plain text -- the
> documents have a scattering of control characters in them. On these text
> files the reader given to me by the Tika#parse() method immediately returns
> null. After some experimentation I found that a single control K character
> early in the file will cause the mime type detector to give up and label it
> application/octet-stream. Please consider adding a recognizer because it
> would be great if Tika could clean up these files by dropping text
> characters. I note that if I drop this file into the Tika GUI, or if I
> invoke Tika on the command line it does well, and I think this behavior is
> obtained by using the file name as a hint. I probably should be using a
> different Tika method, trying to figure that out next. Thanks for listening.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira