Enhance content-type detector to recognize almost plain text
------------------------------------------------------------

                 Key: TIKA-688
                 URL: https://issues.apache.org/jira/browse/TIKA-688
             Project: Tika
          Issue Type: Improvement
          Components: mime
    Affects Versions: 0.9
            Reporter: Chris Lott
            Priority: Minor
             Fix For: 1.0


I am using TIKA to convert a collection of documents that includes files named 
something.txt.  I use the Tika#parse(InputStream) interface to get a parser 
that auto detects content.  The files are almost plain text -- the documents 
have a scattering of control characters in them.  On these text files the 
reader given to me by the Tika#parse() method immediately returns null.  After 
some experimentation I found that a single control K character early in the 
file will cause the mime type detector to give up and label it 
application/octet-stream.  Please consider adding a recognizer because it would 
be great if Tika could clean up these files by dropping text characters.  I 
note that if I drop this file into the Tika GUI, or if I invoke Tika on the 
command line it does well, and I think this behavior is obtained by using the 
file name as a hint.  I probably should be using a different Tika method, 
trying to figure that out next.  Thanks for listening.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to