[
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka resolved TIKA-806.
-------------------------------
Resolution: Not A Problem
Fix Version/s: 1.1
Assignee: Antoni Mylka
You're right. No further comments. I guess I can just make use of my
newly-found JIRA authority and close this issue as "Not a Problem". Then I'll
add the hack to the app. If in doubt - reopen.
> MS Word Detection magics are a bit overzealous
> ----------------------------------------------
>
> Key: TIKA-806
> URL: https://issues.apache.org/jira/browse/TIKA-806
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.1
> Reporter: Antoni Mylka
> Assignee: Antoni Mylka
> Fix For: 1.1
>
> Attachments: tika-806-ver2.patch, tika-806-ver3.zip
>
>
> tika-mimetypes.xml contains a following magic for MS Word:
> {noformat}
> <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
> <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t"
> type="string" offset="1152:4096" />
> </match>
> {noformat}
> So if a file is an MS Office document (parent Office magic) and has the
> WordDocument string within the given offsets, then it's Word. I have a few
> (regrettably confidential) counterexamples of MS Excel files with embedded
> Word documents. For instance one has "Workbook" (with 0x00 between
> characters) at offset 0x0480 and "WordDocument" (0x00's between characters)
> at offset 0x0B80. This is an Excel file, which does meet the above-mentioned
> magic criterion. Returning x-tika-msoffice would dispatch the file to POI
> detector, which would return the correct answer.
> I vote for removing that magic. I took a look at some of my files and it
> seems that "WordDocument" and "Workbook" strings do occur at various offsets.
> The presence of embedded documents makes detection by those strings
> unreliable.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira