Tim Allison created TIKA-2311:
---------------------------------

             Summary: Create tika-ooxml-unk mime type
                 Key: TIKA-2311
                 URL: https://issues.apache.org/jira/browse/TIKA-2311
             Project: Tika
          Issue Type: Bug
            Reporter: Tim Allison


The following is an unintended consequence of TIKA-2212.

The OOXML parser used to handle {{x-tika-ooxml}}. We have some truncated ooxml 
files in our regression corpus.  The previous behavior was:

1) ZipPackage detector caught the zip truncation exception and returned 
"application/zip"
2) The mime detector recognized magic and returned {{x-tika-ooxml}}
3) The file was then routed to the OOXML parser which didn't wind up doing much 
with the content because it hit the zip exception early on, but the final mime 
type was {{x-tika-ooxml}}.

The current behavior
1) Same detection steps
2) However, because the OOXML parser no longer handles {{x-tika-ooxml}}, the 
file is handled by the Package Parser, which overwrites the magic-determined 
mime type, and the new mime type is {{application/zip}}.
3) Some content is extracted because the Package parser handles the zip entries 
in order and only throws the exception once it hits the last entry in the zip 
file.

Ideally, I'd like to keep the magic-determined mime detection.  Once we can 
chain parsers, the user should be able to backoff to the PackageParser, but I 
don't think this should be the default behavior.

One solution would be to create a new mime type that is not the parent of the 
other ooxml subtypes, but is itself a leaf subtype, something like: 
{{x-tika-ooxml-unk}}.

Any objections/other recommendations?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to