Luca created TIKA-3822:
--------------------------

             Summary: Plain text file reported as application/octet-stream
                 Key: TIKA-3822
                 URL: https://issues.apache.org/jira/browse/TIKA-3822
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 1.28
            Reporter: Luca
         Attachments: plaintextfile.txt

I need my application to detect as "text/plain" short files which contain some 
control characters (SOH, STX, ETX, VT ...).

Depending on the total lenght of the file it may happen that the percentage of 
control chars overcomes 2%, causingĀ  isMostlyAscii method to return "false" (an 
example is attached).

Is there any suggestion to avoid this issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to