Lawrence Moorehead created TIKA-4732:
----------------------------------------
Summary: Duplicate source file included in unpack/all server
responses
Key: TIKA-4732
URL: https://issues.apache.org/jira/browse/TIKA-4732
Project: Tika
Issue Type: Improvement
Components: tika-server
Reporter: Lawrence Moorehead
I was looking into the tika-4.0.0-alpha1 release unpack/all endpoint and
noticed that the REGULAR and FRICTIONLESS formats both insert two copies of the
original source file into the zip response. One version is named in the normal
way for the format and one is named after the temp file created in the
PipesWorker. I assume this wasn't intentional since the ParseHandler also
includes the file based on the {{includeOriginal}} config.
I put an example output below, hopefully that clarifies what I'm seeing.
REGULAR unpack/all on a .docx with an embedded image.
||Path||Note||
|0.docx| |
|0.docx.metadata.json| |
|1.jpg| |
|1.jpg.metadata.json| |
|tika-unpack-6334398074681213642.docx|Duplicate of 0.docx|
FRICTIONLESS unpack/all
||Path||Note||
|datapackage.json|Three file references|
|metadata.json|Two metadata entries.|
|unpacked/0.docx| |
|unpacked/1.jpg| |
|tika-unpack-16513904232837934973.docx|Duplicate|
Here are my changes to update the PipesWorker so we don't get two copies of the
original file: https://github.com/elemdisc/tika/pull/2
Additionally I noticed that the temp filename is used in the RESOURCE_NAME_KEY
metadata for the source file in the response. I made an update (in a separate
commit) to reuse the supplied filename in the metadata. I think that makes more
sense for most people and doesn't cause any issues with anything else in Tika,
but it doesn't actually impact my usage so I'm fine leaving this alone too.
Let me know if there's something I'm missing in terms of why it was setup this
way or issues with the changes I'm suggesting. Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)