Tim Allison created TIKA-4617:
---------------------------------
Summary: Fix embedded stream translator to avoid changing file name
Key: TIKA-4617
URL: https://issues.apache.org/jira/browse/TIKA-4617
Project: Tika
Issue Type: Task
Reporter: Tim Allison
In 3.x, while reviewing the results in preparation for the next release, I
noticed embedded ole file names were different between the last release and the
current branch_3x.
The cause of this difference is that we're now applying the stream translators
on embedded oles when we run digesting. I had copy+pasted the stream
translation code from tika-app or maybe tika-server's /unpack and not thought
clearly enough about 2nd order consequences.
During the digesting phase, the stream translators are modifying the embedded
file name.
We should fix this so that digesting doesn't modify the metadata at all...aside
from adding digests.
In looking at some of the other diffs, I think this causes quite a few second
and third order problems. Once we fix this, I _think_ we'll have addressed most
of the issues in the 3.x diffs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)