Tim Allison created TIKA-4219:
---------------------------------
Summary: Figure out what to do with epubs with encrypted non-core
content
Key: TIKA-4219
URL: https://issues.apache.org/jira/browse/TIKA-4219
Project: Tika
Issue Type: Task
Reporter: Tim Allison
On TIKA-4218, we noticed several epubs that were now being identified as
encrypted, which is good. We did this work on TIKA-4176.
On the other hand, we found several epubs that were now identified as encrypted
but which had content before we were doing the encryption detection.
The issue in at least one file that I reviewed is that non-core content is
encrypted -- the fonts. So, from a text+metadata extraction, we could still get
all the content and then throw an Encrypted Exception or maybe flag something
as encrypted.
I'm not sure what the best thing to do is in this case.
An example file is here:
http://corpora.tika.apache.org/base/docs/commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T
--
This message was sent by Atlassian Jira
(v8.20.10#820010)