[ 
https://issues.apache.org/jira/browse/TIKA-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-4630:
---------------------------------

    Assignee: Tim Allison

> embeddedRelationshipId is missing from tar files that are children of gzip 
> files (i.e. tarballs)
> ------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4630
>                 URL: https://issues.apache.org/jira/browse/TIKA-4630
>             Project: Tika
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 3.2.3
>            Reporter: Iachimoe
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: tika-embedded-reln-jira.PATCH
>
>
> For some context, my usecase requires me to be able to precisely locate 
> individual files within nested archives, using only metadata from tika.
> I have made this work for zip files using a combination of 
> EMBEDDED_RELATIONSHIP_ID and EMBEDDED_RESOURCE_PATH. EMBEDDED_RESOURCE_PATH 
> on its own does not work because it omits information about which directory a 
> file is stored in within an archive. When trying to make it work for tgz 
> files, the approach breaks because, while the parser quite reasonably 
> considers "archive.tar" to be a child of "archive.tar.gz", the corresponding 
> entry metadata lacks an EMBEDDED_RELATIONSHIP_ID. This makes it difficult to 
> construct a path of the form 
> "archive.tar.gz/archive.tar/my_folder/my_file.txt".
> I attach a rough solution (patch is against branch_3x), with unit test, that 
> appears to work for my use case. The approach is for the 
> RecursiveParserWrapper to set EMBEDDED_RELATIONSHIP_ID once parsing has 
> completed if it has not been set during the parsing process. It sets it to 
> the value of RESOURCE_NAME_KEY , which is probably not ideal, but in practice 
> it seems to contain the filename and parent folders within the archive. 
> Although the aim was to make this work for tar files, the unit test 
> demonstrates that it has a similar effect for chidren of ODT files (this was 
> an accident as the test archive that was already in the codebase happened to 
> have ODT files).
> I could not find descriptions of exactly what EMBEDDED_RELATIONSHIP_ID and 
> EMBEDDED_RESOURCE_PATH are meant to represent, so I'm not sure whether this 
> approach is heading in the right direction, though it seems to work for my 
> usecase. A future enhancement could be for the recursive parser to include a 
> new field with the complete path to each file (again, imagining nested 
> archives it might look something like 
> "archive.tar.gz/archive.tar/folder1/nested.zip/folder2/file.txt").
> Looking forward to everybody's thoughts, and happy to make refinements to the 
> proof of concept attached as needed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to