[jira] [Commented] (TIKA-4732) Duplicate source file included in unpack/all server responses

ASF GitHub Bot (Jira) Fri, 15 May 2026 20:41:23 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081340#comment-18081340
 ]


ASF GitHub Bot commented on TIKA-4732:
--------------------------------------

Copilot commented on code in PR #2820:
URL: https://github.com/apache/tika/pull/2820#discussion_r3251948384


##########
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/PipesWorker.java:
##########
@@ -569,22 +486,21 @@ protected ParseDataOrPipesResult parseFromTuple() throws 
TikaException, Interrup
         }
         // Use newMetadata() to apply any configured write limits
         Metadata metadata = localContext.newMetadata();
+        // Carry the caller-supplied resource name across the fresh-metadata 
boundary so
+        // detection, suffix selection, and the Frictionless manifest's name 
field see
+        // the logical filename rather than whatever the fetcher's path 
happens to be
+        // (e.g., a server-side spool prefix). TikaInputStream.get(path, 
metadata)
+        // already honors a pre-set RESOURCE_NAME_KEY.
+        String suppliedName = 
fetchEmitTuple.getMetadata().get(TikaCoreProperties.RESOURCE_NAME_KEY);

Review Comment:
   `fetchEmitTuple.getMetadata()` can be null (e.g., 
`PipesIterator.COMPLETED_SEMAPHORE` is constructed with null metadata), so 
`fetchEmitTuple.getMetadata().get(TikaCoreProperties.RESOURCE_NAME_KEY)` can 
throw an NPE. Please null-check the tuple metadata before reading from it (or 
treat a null Metadata as empty) when copying `RESOURCE_NAME_KEY` into the new 
`metadata` instance.
   





> Duplicate source file included in unpack/all server responses
> -------------------------------------------------------------
>
>                 Key: TIKA-4732
>                 URL: https://issues.apache.org/jira/browse/TIKA-4732
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-server
>            Reporter: Lawrence Moorehead
>            Priority: Minor
>
> I was looking into the tika-4.0.0-alpha1 release unpack/all endpoint and 
> noticed that the REGULAR and FRICTIONLESS formats both insert two copies of 
> the original source file into the zip response. One version is named in the 
> normal way for the format and one is named after the temp file created in the 
> PipesWorker. I assume this wasn't intentional since the ParseHandler also 
> includes the file based on the {{includeOriginal}} config.
> I put an example output below, hopefully that clarifies what I'm seeing.
> REGULAR unpack/all on a .docx with an embedded image.
> ||Path||Note||
> |0.docx| |
> |0.docx.metadata.json| |
> |1.jpg| |
> |1.jpg.metadata.json| |
> |tika-unpack-6334398074681213642.docx|Duplicate of 0.docx|
> FRICTIONLESS unpack/all
> ||Path||Note||
> |datapackage.json|Three file references|
> |metadata.json|Two metadata entries.|
> |unpacked/0.docx| |
> |unpacked/1.jpg| |
> |tika-unpack-16513904232837934973.docx|Duplicate|
> Here are my changes to update the PipesWorker so we don't get two copies of 
> the original file: https://github.com/elemdisc/tika/pull/2
> Additionally I noticed that the temp filename is used in the 
> RESOURCE_NAME_KEY metadata for the source file in the response. I made an 
> update (in a separate commit) to reuse the supplied filename in the metadata. 
> I think that makes more sense for most people and doesn't cause any issues 
> with anything else in Tika, but it doesn't actually impact my usage so I'm 
> fine leaving this alone too.
> Let me know if there's something I'm missing in terms of why it was setup 
> this way or issues with the changes I'm suggesting. Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4732) Duplicate source file included in unpack/all server responses

Reply via email to