[ 
https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106136#comment-16106136
 ] 

Matthew Caruana Galizia commented on TIKA-2436:
-----------------------------------------------

To give you an example of why this is a problem in an actual use case, we are 
ingesting the text extracting from files into Solr. The way the files are 
stored in the index represents the same hierarchy that you have on disk: files 
extracted from container files are stored in the index as child documents of 
the container document.

Therefore, for an EMZ file within a DOCX file, we end up with three documents:

DOCX -> EMZ -> EMF

Whereas we expect:

DOCX -> EMZ

> Support for GZIP-compressed EMF files
> -------------------------------------
>
>                 Key: TIKA-2436
>                 URL: https://issues.apache.org/jira/browse/TIKA-2436
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime, parser
>    Affects Versions: 1.15
>            Reporter: Matthew Caruana Galizia
>         Attachments: image004.emz
>
>
> Tika is currently detecting EMZ (compressed EMF) files as simple gzip files. 
> These files should instead be detected as EMF files and the EMFParser should 
> perform decompression transparently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to