[ 
https://issues.apache.org/jira/browse/TIKA-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081336#comment-18081336
 ] 

Tim Allison commented on TIKA-4732:
-----------------------------------

Yikes. Great catch [~lawrencem] . Thank you! Let me know what you think of the 
PR...it is largely yours.

> Duplicate source file included in unpack/all server responses
> -------------------------------------------------------------
>
>                 Key: TIKA-4732
>                 URL: https://issues.apache.org/jira/browse/TIKA-4732
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-server
>            Reporter: Lawrence Moorehead
>            Priority: Minor
>
> I was looking into the tika-4.0.0-alpha1 release unpack/all endpoint and 
> noticed that the REGULAR and FRICTIONLESS formats both insert two copies of 
> the original source file into the zip response. One version is named in the 
> normal way for the format and one is named after the temp file created in the 
> PipesWorker. I assume this wasn't intentional since the ParseHandler also 
> includes the file based on the {{includeOriginal}} config.
> I put an example output below, hopefully that clarifies what I'm seeing.
> REGULAR unpack/all on a .docx with an embedded image.
> ||Path||Note||
> |0.docx| |
> |0.docx.metadata.json| |
> |1.jpg| |
> |1.jpg.metadata.json| |
> |tika-unpack-6334398074681213642.docx|Duplicate of 0.docx|
> FRICTIONLESS unpack/all
> ||Path||Note||
> |datapackage.json|Three file references|
> |metadata.json|Two metadata entries.|
> |unpacked/0.docx| |
> |unpacked/1.jpg| |
> |tika-unpack-16513904232837934973.docx|Duplicate|
> Here are my changes to update the PipesWorker so we don't get two copies of 
> the original file: https://github.com/elemdisc/tika/pull/2
> Additionally I noticed that the temp filename is used in the 
> RESOURCE_NAME_KEY metadata for the source file in the response. I made an 
> update (in a separate commit) to reuse the supplied filename in the metadata. 
> I think that makes more sense for most people and doesn't cause any issues 
> with anything else in Tika, but it doesn't actually impact my usage so I'm 
> fine leaving this alone too.
> Let me know if there's something I'm missing in terms of why it was setup 
> this way or issues with the changes I'm suggesting. Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to