Petr Pytelka created TIKA-1116:
----------------------------------
Summary: Wrong detection of XLS/Doc fil
Key: TIKA-1116
URL: https://issues.apache.org/jira/browse/TIKA-1116
Project: Tika
Issue Type: Bug
Components: mime
Affects Versions: 1.3, 1.4
Reporter: Petr Pytelka
My issue:
I have valid XLS file and this file is detected as DOC.
Cause:
tika-mimetypes.xml contain lines:
<mime-type type="application/msword">
..
<match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
..
</mime-type>
According to MS documentation this prefix can be in any Compound Binary file
(DOC, XLS, PPT and others).
There is documentation:
http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/WindowsCompoundBinaryFileFormatSpecification.pdf
(look at 2.1 Header)
My proposal is to remove line
<match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
from tika-mimetypes.xml.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira