[
https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka closed TIKA-814.
-----------------------------
Resolution: Fixed
Fix Version/s: 1.1
Committed in r1220698.
This is a change, which theoretically impacts all users of Tika invoking
MimeTypes. I say it has negligible performance overhead and yields better
results on 5 broken BMP files I have in my collections.
If you disagree: revert the change and reopen this issue. I'll create a second
solution, with customizable plain text detection.
For now, I close this.
> Increase the amount of bytes read by TextDetector
> -------------------------------------------------
>
> Key: TIKA-814
> URL: https://issues.apache.org/jira/browse/TIKA-814
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 1.1
> Reporter: Antoni Mylka
> Fix For: 1.1
>
> Attachments: tika-textdetector.patch
>
>
> In TIKA-688 Jukka implemented a plain text detector. It is fired
> automatically inside MimeTypes. I find a number of files in my collections,
> which are binary but are still detected as plain text. They wouldn't be if
> the plain text detector were allowed to look at more than the initial 512
> bytes. I think that the TextDetector should look at MimeTypes.getMinLength
> bytes. It is given a ByteArrayInputStream backed by an Array. It should read
> all bytes in that array.
> The performance impact should be negligible (no I/O, no allocations, just
> pure array lookups), while my experiments show that there are cases when 512
> bytes is not enough.
> If anyone objects due to performance reasons, I'll create another patch,
> which will allow the users to decouple the TextDetector from MimeTypes and
> supply their own, with different settings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira