[ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694521#comment-13694521
 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1109:
---------------------------------------------------

Nick, thanks a lot for your explanation. If I understand correctly, what you 
are saying is that in general it cannot be guaranteed that the metadata is 
available during parsing, since that will depend on the format whether that's 
possible or not. That makes complete sense.

Here I am asking specifically about the OOXML formats, with an example pptx 
file. As I understand the OOXML formats are zip files containing xml files. In 
test-classes/test-documents/testPPT.pptx, the metadata seems to be inside 
docProps/core.xml. Would it be possible to read the metadata first from there, 
before starting the parsing?

                
> Metadata not extracted before the context in OOXML (pptx)
> ---------------------------------------------------------
>
>                 Key: TIKA-1109
>                 URL: https://issues.apache.org/jira/browse/TIKA-1109
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Priority: Critical
>             Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> <meta name="Content-Length" content="36518"/>
> <meta name="Content-Type" 
> content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> <meta name="resourceName" content="testPPT.pptx"/>
> while there is more medata in the file (e.g. <dc:title>Attachment 
> Test</dc:title>).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to