[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File

2014-06-04 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1212:
--

Attachment: test_recursive_embedded.docx

[~gagravarr], I'm not sure that the example code works with the attached.  The 
issue is that the code keeps appending to whether or not there is new depth.

The structure in the attached is:
{noformat}
test_recursive.docx
   embed1.zip
   embed1a.txt
   embed1b.txt
   embed2.zip
  embed2a.txt
  embed2b.txt
  embed3.zip
  embed3.txt
  embed4.zip
   embed4.txt
   
{noformat}

{noformat}

Resource is test_recursive.docx/embedded-1/image1.emf

embeddedRelationshipId=rId7 Content-Type=application/x-emf 
resourceName=image1.emf 



Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt

Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed1a.txt 
Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed1a.txt 

embed_1a


Resource is 
test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt

Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed1b.txt 
Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed1b.txt 

embed_1b


Resource is 
test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt

Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed2a.txt 
Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed2a.txt 

embed_2a


Resource is 
test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt

Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed2b.txt 
Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed2b.txt 

embed_2b


Resource is 
test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt

Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed3.txt 
Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed3.txt 

embed_3


Resource is 
test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt/embed4.zip/embed4.txt

Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed4.txt 
Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed4.txt 

embed_4


Resource is 
test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt/embed4.zip

embeddedRelationshipId=embed4.zip Content-Type=application/zip 
resourceName=embed4.zip 


embed4.txt



Resource is 
test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip

embeddedRelationshipId=embed3.zip Content-Type=application/zip 
resourceName=embed3.zip 


embed3.txt


embed4.zip



Resource is 
test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip

embeddedRelationshipId=embed2.zip Content-Type=application/zip 
resourceName=embed2.zip 


embed2a.txt


embed2b.txt


embed3.zip



Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip

embeddedRelationshipId=rId8 Content-Type=application/zip 
resourceName=embed1.zip 


embed1a.txt


embed1b.txt


embed2.zip



Resource is test_recursive.docx/embedded-1

cp:revision=1 meta:save-date=2014-06-04T14:19:00Z Application-Name=Microsoft 
Office Word dcterms:created=2014-06-04T14:19:00Z Application-Version=14. 
Character-Count-With-Spaces=30 date=2014-06-04T14:19:00Z 
extended-properties:Template=Normal.dotm meta:line-count=1 publisher= 
Word-Count=4 meta:paragraph-count=1 Creation-Date=2014-06-04T14:19:00Z 
extended-properties:AppVersion=14. Line-Count=1 
extended-properties:Application=Microsoft Office Word Paragraph-Count=1 
Last-Save-Date=2014-06-04T14:19:00Z Revision-Number=1 
dcterms:modified=2014-06-04T14:19:00Z meta:creation-date=2014-06-04T14:19:00Z 
Template=Normal.dotm Page-Count=1 meta:character-count=27 
Last-Modified=2014-06-04T14:19:00Z extended-properties:Company= 
meta:word-count=4 modified=2014-06-04T14:19:00Z xmpTPg:NPages=1 dc:publisher= 
Character Count=27 meta:page-count=1 meta:character-count-with-spaces=30 
Content-Type=application/vnd.openxmlformats-officedocument.wordprocessingml.document
 




embed_0  

{noformat}

 Recursive Extraction of Archive File
 

 Key: TIKA-1212
 URL: https://issues.apache.org/jira/browse/TIKA-1212
 Project: Tika
  Issue Type: Bug
Reporter: Vikram
Priority: Critical
 Attachments: RecursiveMetadataParserZukka.java, TIKA-Output.xlsx, 
 abc.zip, abc.zip, 

[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File

2014-06-04 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1212:
--

Attachment: RECURSIVE_PARSER_WRAPPER_HACK.patch

This is nowhere near ready to go, but this shows the solution I mentioned.

There has to be an easier way...

SIDENOTE:
The overall design is that the parser wrapper returns a 
{code}ListMetadata{code} and the content for each embedded file is stored 
within a Metadata.  This goes against several design principles within Tika, 
but I'd like to add something like this for TIKA-1302, and there may be 
interest for clients of server and cli.  If this looks to be useful, I'll clean 
it up, add test cases and commit.

 Recursive Extraction of Archive File
 

 Key: TIKA-1212
 URL: https://issues.apache.org/jira/browse/TIKA-1212
 Project: Tika
  Issue Type: Bug
Reporter: Vikram
Priority: Critical
 Attachments: RECURSIVE_PARSER_WRAPPER_HACK.patch, 
 RecursiveMetadataParserZukka.java, RecursiveParsingExample.java, 
 TIKA-Output.xlsx, abc.zip, abc.zip, test_recursive_embedded.docx


 Please refer the code: 
 http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
 Requirement:
 -
 abc.zip
--- a.doc
--- b.xls
--- pqr.zip
   - m.ppt
 There are two issues with TIKA:
 1. How to block extraction embedded doc separately optionally?
 2. When I extract recussively, file name / or resourceKeyName is not coming 
 properly. For example
 -- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
 fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
 should have value abc.zip/pqr.zip/m.ppt.
 -- Even for the Embedded doc, only random name is coming.. not even with 
 proper file path.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File

2013-12-20 Thread Vikram (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram updated TIKA-1212:
-

Attachment: RecursiveMetadataParserZukka.java

Please find the attached standalone program. You need to change the package. I 
am using tika 1.4.

 Recursive Extraction of Archive File
 

 Key: TIKA-1212
 URL: https://issues.apache.org/jira/browse/TIKA-1212
 Project: Tika
  Issue Type: Bug
Reporter: Vikram
Priority: Critical
 Attachments: RecursiveMetadataParserZukka.java, TIKA-Output.xlsx, 
 abc.zip, abc.zip


 Please refer the code: 
 http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
 Requirement:
 -
 abc.zip
--- a.doc
--- b.xls
--- pqr.zip
   - m.ppt
 There are two issues with TIKA:
 1. How to block extraction embedded doc separately optionally?
 2. When I extract recussively, file name / or resourceKeyName is not coming 
 properly. For example
 -- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
 fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
 should have value abc.zip/pqr.zip/m.ppt.
 -- Even for the Embedded doc, only random name is coming.. not even with 
 proper file path.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File

2013-12-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1212:
--

Attachment: abc.zip

Does this test file meet your description?

 Recursive Extraction of Archive File
 

 Key: TIKA-1212
 URL: https://issues.apache.org/jira/browse/TIKA-1212
 Project: Tika
  Issue Type: Bug
Reporter: Vikram
Priority: Critical
 Attachments: abc.zip


 Please refer the code: 
 http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
 Requirement:
 -
 abc.zip
--- a.doc
--- b.xls
--- pqr.zip
   - m.ppt
 There are two issues with TIKA:
 1. How to block extraction embedded doc separately optionally?
 2. When I extract recussively, file name / or resourceKeyName is not coming 
 properly. For example
 -- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
 fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
 should have value abc.zip/pqr.zip/m.ppt.
 -- Even for the Embedded doc, only random name is coming.. not even with 
 proper file path.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File

2013-12-19 Thread Vikram (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram updated TIKA-1212:
-

Attachment: TIKA-Output.xlsx
abc.zip

I have create one more UPDATED --- abc.zip file and also put the detail of the 
code out put in the TIKA-Output.xlsx separately with the example , output and 
what is the expectation. Please let me know if you need any further detail. We 
need this fix or solution as soon as possible. Thanks  a lot.

 Recursive Extraction of Archive File
 

 Key: TIKA-1212
 URL: https://issues.apache.org/jira/browse/TIKA-1212
 Project: Tika
  Issue Type: Bug
Reporter: Vikram
Priority: Critical
 Attachments: TIKA-Output.xlsx, abc.zip, abc.zip


 Please refer the code: 
 http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
 Requirement:
 -
 abc.zip
--- a.doc
--- b.xls
--- pqr.zip
   - m.ppt
 There are two issues with TIKA:
 1. How to block extraction embedded doc separately optionally?
 2. When I extract recussively, file name / or resourceKeyName is not coming 
 properly. For example
 -- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
 fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
 should have value abc.zip/pqr.zip/m.ppt.
 -- Even for the Embedded doc, only random name is coming.. not even with 
 proper file path.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (TIKA-1212) Recursive Extraction of Archive File

2013-12-18 Thread Vikram (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram updated TIKA-1212:
-

Description: 
Please refer the code: 
http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
Requirement:
-
abc.zip
   --- a.doc
   --- b.xls
   --- pqr.zip
  - m.ppt
There are two issues with TIKA:
1. How to block extraction embedded doc separately optionally?
2. When I extract recussively, file name / or resourceKeyName is not coming 
properly. For example
-- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
should have value abc.zip/pqr.zip/m.ppt.
-- Even for the Embedded doc, only random name is coming.. not even with 
proper file path.



  was:
Please refer the code: 
http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
Requirement:
-
abc.zip
   --- a.doc
    b.xls
  - pqr.zip
   --- m.ppt
There are two issues with TIKA:
1. How to block extraction embedded doc separately optionally?
2. When I extract recussively, file name / or resourceKeyName is not coming 
properly. For example
-- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
should have value abc.zip/pqr.zip/m.ppt.
-- Even for the Embedded doc, only random name is coming.. not even with 
proper file path.




 Recursive Extraction of Archive File
 

 Key: TIKA-1212
 URL: https://issues.apache.org/jira/browse/TIKA-1212
 Project: Tika
  Issue Type: Bug
Reporter: Vikram
Priority: Critical

 Please refer the code: 
 http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
 Requirement:
 -
 abc.zip
--- a.doc
--- b.xls
--- pqr.zip
   - m.ppt
 There are two issues with TIKA:
 1. How to block extraction embedded doc separately optionally?
 2. When I extract recussively, file name / or resourceKeyName is not coming 
 properly. For example
 -- a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is 
 fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This 
 should have value abc.zip/pqr.zip/m.ppt.
 -- Even for the Embedded doc, only random name is coming.. not even with 
 proper file path.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)