[
https://issues.apache.org/jira/browse/TIKA-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081340#comment-18081340
]
ASF GitHub Bot commented on TIKA-4732:
--------------------------------------
Copilot commented on code in PR #2820:
URL: https://github.com/apache/tika/pull/2820#discussion_r3251948384
##########
tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/PipesWorker.java:
##########
@@ -569,22 +486,21 @@ protected ParseDataOrPipesResult parseFromTuple() throws
TikaException, Interrup
}
// Use newMetadata() to apply any configured write limits
Metadata metadata = localContext.newMetadata();
+ // Carry the caller-supplied resource name across the fresh-metadata
boundary so
+ // detection, suffix selection, and the Frictionless manifest's name
field see
+ // the logical filename rather than whatever the fetcher's path
happens to be
+ // (e.g., a server-side spool prefix). TikaInputStream.get(path,
metadata)
+ // already honors a pre-set RESOURCE_NAME_KEY.
+ String suppliedName =
fetchEmitTuple.getMetadata().get(TikaCoreProperties.RESOURCE_NAME_KEY);
Review Comment:
`fetchEmitTuple.getMetadata()` can be null (e.g.,
`PipesIterator.COMPLETED_SEMAPHORE` is constructed with null metadata), so
`fetchEmitTuple.getMetadata().get(TikaCoreProperties.RESOURCE_NAME_KEY)` can
throw an NPE. Please null-check the tuple metadata before reading from it (or
treat a null Metadata as empty) when copying `RESOURCE_NAME_KEY` into the new
`metadata` instance.
> Duplicate source file included in unpack/all server responses
> -------------------------------------------------------------
>
> Key: TIKA-4732
> URL: https://issues.apache.org/jira/browse/TIKA-4732
> Project: Tika
> Issue Type: Improvement
> Components: tika-server
> Reporter: Lawrence Moorehead
> Priority: Minor
>
> I was looking into the tika-4.0.0-alpha1 release unpack/all endpoint and
> noticed that the REGULAR and FRICTIONLESS formats both insert two copies of
> the original source file into the zip response. One version is named in the
> normal way for the format and one is named after the temp file created in the
> PipesWorker. I assume this wasn't intentional since the ParseHandler also
> includes the file based on the {{includeOriginal}} config.
> I put an example output below, hopefully that clarifies what I'm seeing.
> REGULAR unpack/all on a .docx with an embedded image.
> ||Path||Note||
> |0.docx| |
> |0.docx.metadata.json| |
> |1.jpg| |
> |1.jpg.metadata.json| |
> |tika-unpack-6334398074681213642.docx|Duplicate of 0.docx|
> FRICTIONLESS unpack/all
> ||Path||Note||
> |datapackage.json|Three file references|
> |metadata.json|Two metadata entries.|
> |unpacked/0.docx| |
> |unpacked/1.jpg| |
> |tika-unpack-16513904232837934973.docx|Duplicate|
> Here are my changes to update the PipesWorker so we don't get two copies of
> the original file: https://github.com/elemdisc/tika/pull/2
> Additionally I noticed that the temp filename is used in the
> RESOURCE_NAME_KEY metadata for the source file in the response. I made an
> update (in a separate commit) to reuse the supplied filename in the metadata.
> I think that makes more sense for most people and doesn't cause any issues
> with anything else in Tika, but it doesn't actually impact my usage so I'm
> fine leaving this alone too.
> Let me know if there's something I'm missing in terms of why it was setup
> this way or issues with the changes I'm suggesting. Thanks!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)