[
https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-791:
------------------------------
Attachment: tika-791-ver2.zip
Attached an updated patch which uses a new media type
"application/x-tika-ooxml-protected" for protected OOXML files with the OLE2
magic. This allowed me to implement my own detector which does:
{noformat}
if mimeTypes says ms-office
and poi says ooxml-protected
and name implies an ooxml subtype
then return the type implied by name
{noformat}
Right now it can't be done with any of the built-in Tika Detectors. If you
think it would be a good idea then perhaps, this would warrant a new issue.
> Fix the detection of protected OOXML files
> ------------------------------------------
>
> Key: TIKA-791
> URL: https://issues.apache.org/jira/browse/TIKA-791
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Affects Versions: 1.1
> Environment: Windows 7 64 bit
> Reporter: Antoni Mylka
> Attachments: tika-791-ver2.zip, tika-791.zip
>
>
> TIKA-437 patch allowed Tika to work with OOXML files protected with the
> default VelvetSweatshop password. I feel there is room for improvement.
> # The POIFSContainerDetector lies when it sees such a file. It should be able
> to mark it as x-tika-ooxml
> # The OOXMLParser can't work with such a file. It should:
> ## If it's protected with the default password - it should be decrypted and
> processed normally.
> ## If it's protected with a non-default password - the file should be marked
> as protected, no weird exceptions should appear.
> Therefore I'd like to add an 'if' to POIFSContainerDetector which returns
> x-tika-ooxml, and some code to OOXMLParser, which would be similar to the
> code currently residing in OfficeParser. After this improvement both the
> OfficeParser and the OOXMLParser will treat such files in the same way.
> When I have that, I can add a hack in my application, which will say "If the
> type is x-tika-ooxml and the name-based detection is a specialization of
> ooxml, then use the name-based detection". This will be a workaround for the
> fact that in MimeTypes, magic always trumps the name. With that, the
> encrypted DOCX files will appear with the normal DOCX mimetype in my app.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira