Iachimoe created TIKA-4705:
------------------------------

             Summary: resourceName of tar file in nested tarball should not 
contain tarball's parent directories
                 Key: TIKA-4705
                 URL: https://issues.apache.org/jira/browse/TIKA-4705
             Project: Tika
          Issue Type: Improvement
            Reporter: Iachimoe


Example structure:

test-nested-tarball.tar contains:

 folderContainingTgz/inner/nested.tgz

 

The resource name for nested.tgz would be 
`folderContainingTgz/inner/nested.tgz` , which is consistent with the general 
behaviour for nested archives (e.g. zips).

However, if nested.tgz does not contain metadata specifying the name of the 
nested file within, then that file will have a resourceName of 
`folderContainingTgz/inner/nested.tar`. This is inconsistent with how other 
nested archives behave, because parent folders should are generally only 
included if they relate to the immediate parent archive. The parent archive of 
nested.tgz in this example is test-nested-tarball.tar , and that is why it 
makes sense for the folders to be included. However, the parent archive of 
nested.tar is nested.tgz , and there is no folder called folderContainingTgz 
within nested.tgz .

 

Draft pull request to follow with a unit test that will hopefully make the 
issue clear, and a proposed fix.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to