[jira] [Commented] (TIKA-2311) Preserve "x-tika-ooxml" mime value for truncated ooxml files

Hudson (JIRA) Thu, 13 Apr 2017 12:04:01 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968089#comment-15968089
 ]


Hudson commented on TIKA-2311:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1239 (See 
[https://builds.apache.org/job/Tika-trunk/1239/])
TIKA-2311 -- maintain mime information for truncated ooxml (tallison: 
[https://github.com/apache/tika/commit/3aab15f8f277614e3c5783c4862e25d63b737425])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (add) tika-parsers/src/test/resources/test-documents/testWORD_truncated.docx
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/TarParserTest.java


> Preserve "x-tika-ooxml" mime value for truncated ooxml files
> ------------------------------------------------------------
>
>                 Key: TIKA-2311
>                 URL: https://issues.apache.org/jira/browse/TIKA-2311
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>             Fix For: 2.0, 1.15
>
>
> The following is an unintended consequence of TIKA-2212.
> The OOXML parser used to handle {{x-tika-ooxml}}. We have some truncated 
> ooxml files in our regression corpus.  The previous behavior was:
> 1) ZipPackage detector caught the zip truncation exception and returned 
> "application/zip"
> 2) The mime detector recognized magic and returned {{x-tika-ooxml}}
> 3) The file was then routed to the OOXML parser which didn't wind up doing 
> much with the content because it hit the zip exception early on, but the 
> final mime type was {{x-tika-ooxml}}.
> The current behavior
> 1) Same detection steps
> 2) However, because the OOXML parser no longer handles {{x-tika-ooxml}}, the 
> file is handled by the Package Parser, which overwrites the magic-determined 
> mime type, and the new mime type is {{application/zip}}.
> 3) Some content is extracted because the Package parser handles the zip 
> entries in order and only throws the exception once it hits the last entry in 
> the zip file.
> Ideally, I'd like to keep the magic-determined mime detection.  Once we can 
> chain parsers, the user should be able to backoff to the PackageParser, but I 
> don't think this should be the default behavior.
> One solution would be to create a new mime type that is not the parent of the 
> other ooxml subtypes, but is itself a leaf subtype, something like: 
> {{x-tika-ooxml-unk}}.
> Any objections/other recommendations?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-2311) Preserve "x-tika-ooxml" mime value for truncated ooxml files

Reply via email to