Tim Allison created TIKA-4219:
---------------------------------

             Summary: Figure out what to do with epubs with encrypted non-core 
content
                 Key: TIKA-4219
                 URL: https://issues.apache.org/jira/browse/TIKA-4219
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


On TIKA-4218, we noticed several epubs that were now being identified as 
encrypted, which is good. We did this work on TIKA-4176.

On the other hand, we found several epubs that were now identified as encrypted 
but which had content before we were doing the encryption detection.

The issue in at least one file that I reviewed is that non-core content is 
encrypted -- the fonts. So, from a text+metadata extraction, we could still get 
all the content and then throw an Encrypted Exception or maybe flag something 
as encrypted.

I'm not sure what the best thing to do is in this case.

An example file is here: 
http://corpora.tika.apache.org/base/docs/commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to