[
https://issues.apache.org/jira/browse/TIKA-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053944#comment-18053944
]
Tim Allison edited comment on TIKA-4630 at 1/23/26 2:53 PM:
------------------------------------------------------------
Note the embedded resource path for the Thumbnail does not include
"Thumbnails/". I want to make a distinction between the embedded_resource_path
which is a synthetic thing that Tika creates for the structure of embedded
files and the "internal_path", which is what the container file says the
internal path is.
If you want
{{/test-documents.tar/testOpenOffice2.odt/Thumbnails/thumbnail.png}}, I think
you should create that on your own by using the embedded ids, the embedded id
paths and the internal paths. I worry that it conflates embedded object
structure with internal paths within package type files. I could maybe see
something like
{{test-documents.tar->testOpenOffice2.odt/Thumbnails->thumbnail.png}}, but that
doesn't sit well for the same reason.
When I get a clean build, I'll push a TIKA-4630 branch so you can look at the
PR.
was (Author: [email protected]):
Note the embedded resource path for the Thumbnail does not include
"Thumbnails/". I want to make a distinction between the embedded_resource_path
which is a synthetic thing that Tika creates for the structure of embedded
files and the "internal_path", which is what the container file says the
internal path is.
If you want
{{/test-documents.tar/testOpenOffice2.odt/Thumbnails/thumbnail.png}}, I think
you should create that on your own. I worry that it conflates embedded object
structure with internal paths within package type files. I could maybe see
something like
{{test-documents.tar->testOpenOffice2.odt/Thumbnails->thumbnail.png}}, but that
doesn't sit well for the same reason.
When I get a clean build, I'll push a TIKA-4630 branch so you can look at the
PR.
> embeddedRelationshipId is missing from tar files that are children of gzip
> files (i.e. tarballs)
> ------------------------------------------------------------------------------------------------
>
> Key: TIKA-4630
> URL: https://issues.apache.org/jira/browse/TIKA-4630
> Project: Tika
> Issue Type: Improvement
> Components: core
> Affects Versions: 3.2.3
> Reporter: Iachimoe
> Assignee: Tim Allison
> Priority: Major
> Attachments: tika-embedded-reln-jira.PATCH
>
>
> For some context, my usecase requires me to be able to precisely locate
> individual files within nested archives, using only metadata from tika.
> I have made this work for zip files using a combination of
> EMBEDDED_RELATIONSHIP_ID and EMBEDDED_RESOURCE_PATH. EMBEDDED_RESOURCE_PATH
> on its own does not work because it omits information about which directory a
> file is stored in within an archive. When trying to make it work for tgz
> files, the approach breaks because, while the parser quite reasonably
> considers "archive.tar" to be a child of "archive.tar.gz", the corresponding
> entry metadata lacks an EMBEDDED_RELATIONSHIP_ID. This makes it difficult to
> construct a path of the form
> "archive.tar.gz/archive.tar/my_folder/my_file.txt".
> I attach a rough solution (patch is against branch_3x), with unit test, that
> appears to work for my use case. The approach is for the
> RecursiveParserWrapper to set EMBEDDED_RELATIONSHIP_ID once parsing has
> completed if it has not been set during the parsing process. It sets it to
> the value of RESOURCE_NAME_KEY , which is probably not ideal, but in practice
> it seems to contain the filename and parent folders within the archive.
> Although the aim was to make this work for tar files, the unit test
> demonstrates that it has a similar effect for chidren of ODT files (this was
> an accident as the test archive that was already in the codebase happened to
> have ODT files).
> I could not find descriptions of exactly what EMBEDDED_RELATIONSHIP_ID and
> EMBEDDED_RESOURCE_PATH are meant to represent, so I'm not sure whether this
> approach is heading in the right direction, though it seems to work for my
> usecase. A future enhancement could be for the recursive parser to include a
> new field with the complete path to each file (again, imagining nested
> archives it might look something like
> "archive.tar.gz/archive.tar/folder1/nested.zip/folder2/file.txt").
> Looking forward to everybody's thoughts, and happy to make refinements to the
> proof of concept attached as needed.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)