[ 
https://issues.apache.org/jira/browse/TIKA-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081846#comment-13081846
 ] 

Chris Lott commented on TIKA-688:
---------------------------------

Forgot to mention this suggestion.  I found the code in MimeTypes.java about 
line 260 that checks for control characters in the first block.  Perhaps this 
could be weighted with some rule to allow the occasional control character.  
For example, if fewer than 1% of the input bytes in the first block read are 
controls, it might be ok to parse as plain text.  Anyhow I switched to using 
Tika#parse(InputStream, Metadata) with better results.

> Enhance content-type detector to recognize almost plain text
> ------------------------------------------------------------
>
>                 Key: TIKA-688
>                 URL: https://issues.apache.org/jira/browse/TIKA-688
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 0.9
>            Reporter: Chris Lott
>            Priority: Minor
>             Fix For: 1.0
>
>
> I am using TIKA to convert a collection of documents that includes files 
> named something.txt.  I use the Tika#parse(InputStream) interface to get a 
> parser that auto detects content.  The files are almost plain text -- the 
> documents have a scattering of control characters in them.  On these text 
> files the reader given to me by the Tika#parse() method immediately returns 
> null.  After some experimentation I found that a single control K character 
> early in the file will cause the mime type detector to give up and label it 
> application/octet-stream.  Please consider adding a recognizer because it 
> would be great if Tika could clean up these files by dropping text 
> characters.  I note that if I drop this file into the Tika GUI, or if I 
> invoke Tika on the command line it does well, and I think this behavior is 
> obtained by using the file name as a hint.  I probably should be using a 
> different Tika method, trying to figure that out next.  Thanks for listening.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to