[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1109:
--

Summary: Metadata not extracted before the content in OOXML (pptx)  (was: 
Metadata not extracted before the context in OOXML (pptx))

 Metadata not extracted before the content in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

2013-06-27 Thread Daniel Bonniot de Ruisselet (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Bonniot de Ruisselet updated TIKA-1109:
--

Attachment: TIKA-1109.patch

 Metadata not extracted before the content in OOXML (pptx)
 -

 Key: TIKA-1109
 URL: https://issues.apache.org/jira/browse/TIKA-1109
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.5

 Attachments: TIKA-1109.patch


 It seems that when processing OOXML documents, the metadata is only read 
 after the text. This means it's impossible to use the medata while processing 
 the text. I think it would be more useful to have the metadata populated 
 first.
 As a symptom:
 java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
 outputs only as metadata:
 meta name=Content-Length content=36518/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=resourceName content=testPPT.pptx/
 while there is more medata in the file (e.g. dc:titleAttachment 
 Test/dc:title).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira