Lawrence Moorehead created TIKA-4732:
----------------------------------------

             Summary: Duplicate source file included in unpack/all server 
responses
                 Key: TIKA-4732
                 URL: https://issues.apache.org/jira/browse/TIKA-4732
             Project: Tika
          Issue Type: Improvement
          Components: tika-server
            Reporter: Lawrence Moorehead


I was looking into the tika-4.0.0-alpha1 release unpack/all endpoint and 
noticed that the REGULAR and FRICTIONLESS formats both insert two copies of the 
original source file into the zip response. One version is named in the normal 
way for the format and one is named after the temp file created in the 
PipesWorker. I assume this wasn't intentional since the ParseHandler also 
includes the file based on the {{includeOriginal}} config.

I put an example output below, hopefully that clarifies what I'm seeing.

REGULAR unpack/all on a .docx with an embedded image.
||Path||Note||
|0.docx| |
|0.docx.metadata.json| |
|1.jpg| |
|1.jpg.metadata.json| |
|tika-unpack-6334398074681213642.docx|Duplicate of 0.docx|


FRICTIONLESS unpack/all
||Path||Note||
|datapackage.json|Three file references|
|metadata.json|Two metadata entries.|
|unpacked/0.docx| |
|unpacked/1.jpg| |
|tika-unpack-16513904232837934973.docx|Duplicate|



Here are my changes to update the PipesWorker so we don't get two copies of the 
original file: https://github.com/elemdisc/tika/pull/2

Additionally I noticed that the temp filename is used in the RESOURCE_NAME_KEY 
metadata for the source file in the response. I made an update (in a separate 
commit) to reuse the supplied filename in the metadata. I think that makes more 
sense for most people and doesn't cause any issues with anything else in Tika, 
but it doesn't actually impact my usage so I'm fine leaving this alone too.

Let me know if there's something I'm missing in terms of why it was setup this 
way or issues with the changes I'm suggesting. Thanks!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to