[ 
https://issues.apache.org/jira/browse/TIKA-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830493#comment-17830493
 ] 

Tim Allison commented on TIKA-4219:
-----------------------------------

This fix tries to extract all content.

In "regular" non-streaming handling, if a content file is encrypted, this 
throws an EncryptedDocumentException immediately. If a non-content resource is 
encrypted, this throws an EncryptedDocumentException after extracting all the 
content.

In streaming mode, this throws an EncryptedDocumentException for anything that 
is encrypted.

The triggering file also showed that we should strip out qnames in our 
handlers. It is possible that xml: namespaces can creep into attributes or 
qnames.

What was weird was that plain tika-app extracted all the content from this file 
in earlier versions (before the encryption "fix") because the handler created 
in plain tika-app is apparently not namespace aware (?), whereas the 
ToTextHandler is(?).  So, we got the full content out of tika-app, but not 
tika-app -J.

This is now also fixed.

> Figure out what to do with epubs with encrypted non-core content
> ----------------------------------------------------------------
>
>                 Key: TIKA-4219
>                 URL: https://issues.apache.org/jira/browse/TIKA-4219
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> On TIKA-4218, we noticed several epubs that were now being identified as 
> encrypted, which is good. We did this work on TIKA-4176.
> On the other hand, we found several epubs that were now identified as 
> encrypted but which had content before we were doing the encryption detection.
> The issue in at least one file that I reviewed is that non-core content is 
> encrypted -- the fonts. So, from a text+metadata extraction, we could still 
> get all the content and then throw an Encrypted Exception or maybe flag 
> something as encrypted.
> I'm not sure what the best thing to do is in this case.
> An example file is here: 
> http://corpora.tika.apache.org/base/docs/commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to