[
https://issues.apache.org/jira/browse/TIKA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967868#comment-15967868
]
Tim Allison commented on TIKA-2311:
-----------------------------------
When I use a static MediaTypesRegistry in PackageParser, the new unit test
passes on a truncated docx, however test-documents.tar is now detected as
"x-gtar".
Looking at the definition:
{noformat}
<mime-type type="application/x-gtar">
<_comment>GNU tar Compressed File Archive (GNU Tape Archive)</_comment>
<magic priority="50">
GNU tar archive
<match value="ustar \0" type="string" offset="257" />
</magic>
<glob pattern="*.gtar"/>
<sub-class-of type="application/x-tar"/>
</mime-type>
{noformat}
Is this really the mime for {{ustar}} not {{gtar}}???
> Create x-tika-ooxml-unk mime type (?)
> -------------------------------------
>
> Key: TIKA-2311
> URL: https://issues.apache.org/jira/browse/TIKA-2311
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
>
> The following is an unintended consequence of TIKA-2212.
> The OOXML parser used to handle {{x-tika-ooxml}}. We have some truncated
> ooxml files in our regression corpus. The previous behavior was:
> 1) ZipPackage detector caught the zip truncation exception and returned
> "application/zip"
> 2) The mime detector recognized magic and returned {{x-tika-ooxml}}
> 3) The file was then routed to the OOXML parser which didn't wind up doing
> much with the content because it hit the zip exception early on, but the
> final mime type was {{x-tika-ooxml}}.
> The current behavior
> 1) Same detection steps
> 2) However, because the OOXML parser no longer handles {{x-tika-ooxml}}, the
> file is handled by the Package Parser, which overwrites the magic-determined
> mime type, and the new mime type is {{application/zip}}.
> 3) Some content is extracted because the Package parser handles the zip
> entries in order and only throws the exception once it hits the last entry in
> the zip file.
> Ideally, I'd like to keep the magic-determined mime detection. Once we can
> chain parsers, the user should be able to backoff to the PackageParser, but I
> don't think this should be the default behavior.
> One solution would be to create a new mime type that is not the parent of the
> other ooxml subtypes, but is itself a leaf subtype, something like:
> {{x-tika-ooxml-unk}}.
> Any objections/other recommendations?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)