[
https://issues.apache.org/jira/browse/TIKA-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081336#comment-18081336
]
Tim Allison commented on TIKA-4732:
-----------------------------------
Yikes. Great catch [~lawrencem] . Thank you! Let me know what you think of the
PR...it is largely yours.
> Duplicate source file included in unpack/all server responses
> -------------------------------------------------------------
>
> Key: TIKA-4732
> URL: https://issues.apache.org/jira/browse/TIKA-4732
> Project: Tika
> Issue Type: Improvement
> Components: tika-server
> Reporter: Lawrence Moorehead
> Priority: Minor
>
> I was looking into the tika-4.0.0-alpha1 release unpack/all endpoint and
> noticed that the REGULAR and FRICTIONLESS formats both insert two copies of
> the original source file into the zip response. One version is named in the
> normal way for the format and one is named after the temp file created in the
> PipesWorker. I assume this wasn't intentional since the ParseHandler also
> includes the file based on the {{includeOriginal}} config.
> I put an example output below, hopefully that clarifies what I'm seeing.
> REGULAR unpack/all on a .docx with an embedded image.
> ||Path||Note||
> |0.docx| |
> |0.docx.metadata.json| |
> |1.jpg| |
> |1.jpg.metadata.json| |
> |tika-unpack-6334398074681213642.docx|Duplicate of 0.docx|
> FRICTIONLESS unpack/all
> ||Path||Note||
> |datapackage.json|Three file references|
> |metadata.json|Two metadata entries.|
> |unpacked/0.docx| |
> |unpacked/1.jpg| |
> |tika-unpack-16513904232837934973.docx|Duplicate|
> Here are my changes to update the PipesWorker so we don't get two copies of
> the original file: https://github.com/elemdisc/tika/pull/2
> Additionally I noticed that the temp filename is used in the
> RESOURCE_NAME_KEY metadata for the source file in the response. I made an
> update (in a separate commit) to reuse the supplied filename in the metadata.
> I think that makes more sense for most people and doesn't cause any issues
> with anything else in Tika, but it doesn't actually impact my usage so I'm
> fine leaving this alone too.
> Let me know if there's something I'm missing in terms of why it was setup
> this way or issues with the changes I'm suggesting. Thanks!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)