[
https://issues.apache.org/jira/browse/TIKA-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053700#comment-18053700
]
Tim Allison edited comment on TIKA-4630 at 1/22/26 9:02 PM:
------------------------------------------------------------
Thank you for opening this and for figuring out this deep part of our codebase.
We need to improve our documentation, and we definitely have room for
improvement.
A couple of things are going on.
1) We don't have tgz detection (yet), so the parent document is the gz file,
and inside that is the tar file.
2) When we generate the embedded resource path, we aren't including the "path"
as reported within the tar file, and perhaps we should? resourcename within the
tar file is "test-documents/testWORD.doc", but we're only using testWORD.doc
when we generate: /embedded-1/testWORD.doc
3) For your use case, you want to use the X-TIKA:embedded_id_path: and the
X-TIKA:embedded_id. Those are numeric signifiers for the internal location of
the embedded files. I used those because the embedded resource paths are
user-generated data and therefore dangerous.
So, the tar "file" has embedded_id of 1, and the xls file has embedded id of 2.
These are the entries for the Excel file:
{noformat}
1: X-TIKA:embedded_id : 2
1: X-TIKA:embedded_id_path : /1/2
1: X-TIKA:embedded_resource_path : /embedded-1/testEXCEL.xls
1: X-TIKA:final_embedded_resource_path : /embedded-1/testEXCEL.xls
{noformat}
Given that there's no name for the tar file inside the gz file, tika defaults
to "embedded-1".
You should be able to reconstruct whatever names you like from the
embedded_id_path and the embedded_ids.
The EMBEDDED_RELATIONSHIP_ID is a microsoft based internal embedded file
identifier that we some times use as backoff if we can't get a resourcename
(as you did).
Given this information, what do you think?
In your example above, maybe we do some hackery to get the embedded file name
from the parent file name but without the .gz in our compressor inputstream? SO
that we could get archive.tar/folder1/nested.zip/folder2/file.txt".
Secondly, perhaps we trust path based information from the container file so
that we use "test-documents/testWORD.doc".
I _think_ we do that with zip? Why aren't we doing it with tar?
was (Author: [email protected]):
Thank you for opening this and for figuring out this deep part of our codebase.
We need to improve our documentation.
A couple of things are going on.
1) We don't have tgz detection (yet), so the parent document is the gz file,
and inside that is the tar file.
2) When we generate the embedded resource path, we aren't including the "path"
as reported within the tar file, and perhaps we should? resourcename within the
tar file is "test-documents/testWORD.doc", but we're only using testWORD.doc
when we generate: /embedded-1/testWORD.doc
3) For your use case, you want to use the X-TIKA:embedded_id_path: and the
X-TIKA:embedded_id. Those are numeric signifiers for the internal location of
the embedded files. I used those because the embedded resource paths are
user-generated data and therefore dangerous.
So, the tar "file" has embedded_id of 1, and the xls file has embedded id of 2.
These are the entries for the Excel file:
{noformat}
1: X-TIKA:embedded_id : 2
1: X-TIKA:embedded_id_path : /1/2
1: X-TIKA:embedded_resource_path : /embedded-1/testEXCEL.xls
1: X-TIKA:final_embedded_resource_path : /embedded-1/testEXCEL.xls
{noformat}
Given that there's no name for the tar file inside the gz file, tika defaults
to "embedded-1".
You should be able to reconstruct whatever names you like from the
embedded_id_path and the embedded_ids.
The EMBEDDED_RELATIONSHIP_ID is a microsoft based internal embedded file
identifier that we some times use as backoff if we can't get a resourcename
(as you did).
Given this information, what do you think?
In your example above, maybe we do some hackery to get the embedded file name
from the parent file name but without the .gz in our compressor inputstream? SO
that we could get archive.tar/folder1/nested.zip/folder2/file.txt".
Secondly, perhaps we trust path based information from the container file so
that we use "test-documents/testWORD.doc".
I _think_ we do that with zip? Why aren't we doing it with tar?
> embeddedRelationshipId is missing from tar files that are children of gzip
> files (i.e. tarballs)
> ------------------------------------------------------------------------------------------------
>
> Key: TIKA-4630
> URL: https://issues.apache.org/jira/browse/TIKA-4630
> Project: Tika
> Issue Type: Improvement
> Components: core
> Affects Versions: 3.2.3
> Reporter: Iachimoe
> Priority: Major
> Attachments: tika-embedded-reln-jira.PATCH
>
>
> For some context, my usecase requires me to be able to precisely locate
> individual files within nested archives, using only metadata from tika.
> I have made this work for zip files using a combination of
> EMBEDDED_RELATIONSHIP_ID and EMBEDDED_RESOURCE_PATH. EMBEDDED_RESOURCE_PATH
> on its own does not work because it omits information about which directory a
> file is stored in within an archive. When trying to make it work for tgz
> files, the approach breaks because, while the parser quite reasonably
> considers "archive.tar" to be a child of "archive.tar.gz", the corresponding
> entry metadata lacks an EMBEDDED_RELATIONSHIP_ID. This makes it difficult to
> construct a path of the form
> "archive.tar.gz/archive.tar/my_folder/my_file.txt".
> I attach a rough solution (patch is against branch_3x), with unit test, that
> appears to work for my use case. The approach is for the
> RecursiveParserWrapper to set EMBEDDED_RELATIONSHIP_ID once parsing has
> completed if it has not been set during the parsing process. It sets it to
> the value of RESOURCE_NAME_KEY , which is probably not ideal, but in practice
> it seems to contain the filename and parent folders within the archive.
> Although the aim was to make this work for tar files, the unit test
> demonstrates that it has a similar effect for chidren of ODT files (this was
> an accident as the test archive that was already in the codebase happened to
> have ODT files).
> I could not find descriptions of exactly what EMBEDDED_RELATIONSHIP_ID and
> EMBEDDED_RESOURCE_PATH are meant to represent, so I'm not sure whether this
> approach is heading in the right direction, though it seems to work for my
> usecase. A future enhancement could be for the recursive parser to include a
> new field with the complete path to each file (again, imagining nested
> archives it might look something like
> "archive.tar.gz/archive.tar/folder1/nested.zip/folder2/file.txt").
> Looking forward to everybody's thoughts, and happy to make refinements to the
> proof of concept attached as needed.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)