[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

Nick Burch (JIRA) Thu, 27 Jun 2013 03:55:38 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694612#comment-13694612
 ]


Nick Burch commented on TIKA-1109:
----------------------------------

For ooxml files, the metadata is mostly in a few different xml files within the 
zip. For excel, there's also a few bits stored in the main spreadsheet / sheet 
stream too...

Not sure if it would break things if we did most of the metadata fetching 
first. Could you try moving the metadata line up, and see if the unit tests all 
still pass?
                
> Metadata not extracted before the context in OOXML (pptx)
> ---------------------------------------------------------
>
>                 Key: TIKA-1109
>                 URL: https://issues.apache.org/jira/browse/TIKA-1109
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Priority: Critical
>             Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read 
> after the text. This means it's impossible to use the medata while processing 
> the text. I think it would be more useful to have the metadata populated 
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> <meta name="Content-Length" content="36518"/>
> <meta name="Content-Type" 
> content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> <meta name="resourceName" content="testPPT.pptx"/>
> while there is more medata in the file (e.g. <dc:title>Attachment 
> Test</dc:title>).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

Reply via email to