[ 
https://issues.apache.org/jira/browse/TIKA-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18070925#comment-18070925
 ] 

ASF GitHub Bot commented on TIKA-4705:
--------------------------------------

tballison commented on code in PR #2730:
URL: https://github.com/apache/tika/pull/2730#discussion_r3034510123


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java:
##########
@@ -174,6 +174,30 @@ public void testTarball() throws Exception {
                 "/test-documents.tar"), actualEmbeddedPaths);
     }
 
+    @Test
+    public void testNestedTarball() throws Exception {
+        List<Metadata> list = getRecursiveMetadata("test-nested-tarball.tar");
+        List<String> actualInternalPaths =
+            list.stream()
+                .map(m -> m.get(TikaCoreProperties.RESOURCE_NAME_KEY))
+                .collect(Collectors.toList());
+
+        List<String> expectedInternalPaths = 
Arrays.asList("test-nested-tarball.tar",
+            "folderWithinTgz/testTXT.txt",
+            "nested.tar",
+            "folderContainingTgz/inner/nested.tgz");
+        assertEquals(expectedInternalPaths, actualInternalPaths);
+
+        List<String> actualEmbeddedPaths =
+            list.stream()
+                .map(m -> m.get(TikaCoreProperties.EMBEDDED_RESOURCE_PATH))
+                .collect(Collectors.toList());
+        assertEquals(Arrays.asList(null,
+            "/nested.tgz/nested.tar/testTXT.txt",
+            "/nested.tgz/nested.tar",
+            "/nested.tgz"), actualEmbeddedPaths);
+    }

Review Comment:
   Add test for `internalPaths`?
   
   There are three things we care about with this change: resource names 
(should be just the file name), embedded resource path and the internal path.





> resourceName of tar file in nested tarball should not contain tarball's 
> parent directories
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4705
>                 URL: https://issues.apache.org/jira/browse/TIKA-4705
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Iachimoe
>            Priority: Major
>
> Example structure:
> test-nested-tarball.tar contains:
>  folderContainingTgz/inner/nested.tgz
>  
> The resource name for nested.tgz would be 
> `folderContainingTgz/inner/nested.tgz` , which is consistent with the general 
> behaviour for nested archives (e.g. zips).
> However, if nested.tgz does not contain metadata specifying the name of the 
> nested file within, then that file will have a resourceName of 
> `folderContainingTgz/inner/nested.tar`. This is inconsistent with how other 
> nested archives behave, because parent folders should are generally only 
> included if they relate to the immediate parent archive. The parent archive 
> of nested.tgz in this example is test-nested-tarball.tar , and that is why it 
> makes sense for the folders to be included. However, the parent archive of 
> nested.tar is nested.tgz , and there is no folder called folderContainingTgz 
> within nested.tgz .
>  
> Draft pull request with a unit test that hopefully makes the issue clear, and 
> a proposed fix at https://github.com/apache/tika/pull/2730/changes
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to