[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1212: -- Attachment: test_recursive_embedded.docx [~gagravarr], I'm not sure that the example code works with the attached. The issue is that the code keeps appending to whether or not there is new depth. The structure in the attached is: {noformat} test_recursive.docx embed1.zip embed1a.txt embed1b.txt embed2.zip embed2a.txt embed2b.txt embed3.zip embed3.txt embed4.zip embed4.txt {noformat} {noformat} Resource is test_recursive.docx/embedded-1/image1.emf embeddedRelationshipId=rId7 Content-Type=application/x-emf resourceName=image1.emf Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed1a.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed1a.txt embed_1a Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed1b.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed1b.txt embed_1b Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed2a.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed2a.txt embed_2a Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed2b.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed2b.txt embed_2b Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed3.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed3.txt embed_3 Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt/embed4.zip/embed4.txt Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed4.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed4.txt embed_4 Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt/embed4.zip embeddedRelationshipId=embed4.zip Content-Type=application/zip resourceName=embed4.zip embed4.txt Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip embeddedRelationshipId=embed3.zip Content-Type=application/zip resourceName=embed3.zip embed3.txt embed4.zip Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip embeddedRelationshipId=embed2.zip Content-Type=application/zip resourceName=embed2.zip embed2a.txt embed2b.txt embed3.zip Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip embeddedRelationshipId=rId8 Content-Type=application/zip resourceName=embed1.zip embed1a.txt embed1b.txt embed2.zip Resource is test_recursive.docx/embedded-1 cp:revision=1 meta:save-date=2014-06-04T14:19:00Z Application-Name=Microsoft Office Word dcterms:created=2014-06-04T14:19:00Z Application-Version=14. Character-Count-With-Spaces=30 date=2014-06-04T14:19:00Z extended-properties:Template=Normal.dotm meta:line-count=1 publisher= Word-Count=4 meta:paragraph-count=1 Creation-Date=2014-06-04T14:19:00Z extended-properties:AppVersion=14. Line-Count=1 extended-properties:Application=Microsoft Office Word Paragraph-Count=1 Last-Save-Date=2014-06-04T14:19:00Z Revision-Number=1 dcterms:modified=2014-06-04T14:19:00Z meta:creation-date=2014-06-04T14:19:00Z Template=Normal.dotm Page-Count=1 meta:character-count=27 Last-Modified=2014-06-04T14:19:00Z extended-properties:Company= meta:word-count=4 modified=2014-06-04T14:19:00Z xmpTPg:NPages=1 dc:publisher= Character Count=27 meta:page-count=1 meta:character-count-with-spaces=30 Content-Type=application/vnd.openxmlformats-officedocument.wordprocessingml.document embed_0 {noformat} Recursive Extraction of Archive File Key: TIKA-1212 URL: https://issues.apache.org/jira/browse/TIKA-1212 Project: Tika Issue Type: Bug Reporter: Vikram Priority: Critical Attachments: RecursiveMetadataParserZukka.java, TIKA-Output.xlsx, abc.zip, abc.zip,
[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1212: -- Attachment: RECURSIVE_PARSER_WRAPPER_HACK.patch This is nowhere near ready to go, but this shows the solution I mentioned. There has to be an easier way... SIDENOTE: The overall design is that the parser wrapper returns a {code}ListMetadata{code} and the content for each embedded file is stored within a Metadata. This goes against several design principles within Tika, but I'd like to add something like this for TIKA-1302, and there may be interest for clients of server and cli. If this looks to be useful, I'll clean it up, add test cases and commit. Recursive Extraction of Archive File Key: TIKA-1212 URL: https://issues.apache.org/jira/browse/TIKA-1212 Project: Tika Issue Type: Bug Reporter: Vikram Priority: Critical Attachments: RECURSIVE_PARSER_WRAPPER_HACK.patch, RecursiveMetadataParserZukka.java, RecursiveParsingExample.java, TIKA-Output.xlsx, abc.zip, abc.zip, test_recursive_embedded.docx Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc --- b.xls --- pqr.zip - m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram updated TIKA-1212: - Attachment: RecursiveMetadataParserZukka.java Please find the attached standalone program. You need to change the package. I am using tika 1.4. Recursive Extraction of Archive File Key: TIKA-1212 URL: https://issues.apache.org/jira/browse/TIKA-1212 Project: Tika Issue Type: Bug Reporter: Vikram Priority: Critical Attachments: RecursiveMetadataParserZukka.java, TIKA-Output.xlsx, abc.zip, abc.zip Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc --- b.xls --- pqr.zip - m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1212: -- Attachment: abc.zip Does this test file meet your description? Recursive Extraction of Archive File Key: TIKA-1212 URL: https://issues.apache.org/jira/browse/TIKA-1212 Project: Tika Issue Type: Bug Reporter: Vikram Priority: Critical Attachments: abc.zip Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc --- b.xls --- pqr.zip - m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram updated TIKA-1212: - Attachment: TIKA-Output.xlsx abc.zip I have create one more UPDATED --- abc.zip file and also put the detail of the code out put in the TIKA-Output.xlsx separately with the example , output and what is the expectation. Please let me know if you need any further detail. We need this fix or solution as soon as possible. Thanks a lot. Recursive Extraction of Archive File Key: TIKA-1212 URL: https://issues.apache.org/jira/browse/TIKA-1212 Project: Tika Issue Type: Bug Reporter: Vikram Priority: Critical Attachments: TIKA-Output.xlsx, abc.zip, abc.zip Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc --- b.xls --- pqr.zip - m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File
[ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram updated TIKA-1212: - Description: Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc --- b.xls --- pqr.zip - m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. was: Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc b.xls - pqr.zip --- m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. Recursive Extraction of Archive File Key: TIKA-1212 URL: https://issues.apache.org/jira/browse/TIKA-1212 Project: Tika Issue Type: Bug Reporter: Vikram Priority: Critical Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example Requirement: - abc.zip --- a.doc --- b.xls --- pqr.zip - m.ppt There are two issues with TIKA: 1. How to block extraction embedded doc separately optionally? 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example -- a.doc should have value abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt. -- Even for the Embedded doc, only random name is coming.. not even with proper file path. -- This message was sent by Atlassian JIRA (v6.1.4#6159)