Enhance content-type detector to recognize almost plain text
------------------------------------------------------------
Key: TIKA-688
URL: https://issues.apache.org/jira/browse/TIKA-688
Project: Tika
Issue Type: Improvement
Components: mime
Affects Versions: 0.9
Reporter: Chris Lott
Priority: Minor
Fix For: 1.0
I am using TIKA to convert a collection of documents that includes files named
something.txt. I use the Tika#parse(InputStream) interface to get a parser
that auto detects content. The files are almost plain text -- the documents
have a scattering of control characters in them. On these text files the
reader given to me by the Tika#parse() method immediately returns null. After
some experimentation I found that a single control K character early in the
file will cause the mime type detector to give up and label it
application/octet-stream. Please consider adding a recognizer because it would
be great if Tika could clean up these files by dropping text characters. I
note that if I drop this file into the Tika GUI, or if I invoke Tika on the
command line it does well, and I think this behavior is obtained by using the
file name as a hint. I probably should be using a different Tika method,
trying to figure that out next. Thanks for listening.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira