[
https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-814:
------------------------------
Attachment: tika-textdetector.patch
A patch, which makes the text detector work on the entire array supplied by
MimeTypes
> Increase the amount of bytes read by TextDetector
> -------------------------------------------------
>
> Key: TIKA-814
> URL: https://issues.apache.org/jira/browse/TIKA-814
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 1.1
> Reporter: Antoni Mylka
> Attachments: tika-textdetector.patch
>
>
> In TIKA-688 Jukka implemented a plain text detector. It is fired
> automatically inside MimeTypes. I find a number of files in my collections,
> which are binary but are still detected as plain text. They wouldn't be if
> the plain text detector were allowed to look at more than the initial 512
> bytes. I think that the TextDetector should look at MimeTypes.getMinLength
> bytes. It is given a ByteArrayInputStream backed by an Array. It should read
> all bytes in that array.
> The performance impact should be negligible (no I/O, no allocations, just
> pure array lookups), while my experiments show that there are cases when 512
> bytes is not enough.
> If anyone objects due to performance reasons, I'll create another patch,
> which will allow the users to decouple the TextDetector from MimeTypes and
> supply their own, with different settings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira