[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1109: -- Summary: Metadata not extracted before the content in OOXML (pptx) (was: Metadata not extracted before the context in OOXML (pptx)) Metadata not extracted before the content in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1109: -- Attachment: TIKA-1109.patch Metadata not extracted before the content in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 Attachments: TIKA-1109.patch It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira