[ 
https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106135#comment-16106135
 ] 

Matthew Caruana Galizia commented on TIKA-2436:
-----------------------------------------------

The difference is that the file is a treated as a package or container format, 
when I don't think it should be. It is a distinct file format that happens to 
be compressed.

Instead of treating it as a container and relying on the CompressorParser to 
call the ParsingEmbeddedDocumentExtractor, the EMFParser should instead have 
native support for the compression, unwrapping the compression itself.

The same should be true for SVGZ and WMZ.

To draw a parallel, DOCX is also a compressed format, but Tika does not treat 
it as a package. It understands that the compression is an artefact of the 
format rather than an explicit container.

> Support for GZIP-compressed EMF files
> -------------------------------------
>
>                 Key: TIKA-2436
>                 URL: https://issues.apache.org/jira/browse/TIKA-2436
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>    Affects Versions: 1.15
>            Reporter: Matthew Caruana Galizia
>         Attachments: image004.emz
>
>
> Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. 
> These files should instead be detected as EMF files and the EMFParser should 
> perform decompression transparently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to