[ 
https://issues.apache.org/jira/browse/TIKA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803357#comment-17803357
 ] 

Peter Wyatt commented on TIKA-4175:
-----------------------------------

Without investing the specific PDF...

* don't forget that Adobe originally defined Collections in PDF 1.7 and this is 
NOT entirely the same as the vendor-neutral Collections defined by ISO in PDF 
2.0 - not the least because the Adobe version defined using FLASH/SWF for 
navigators (i.e. the UX). So check the Adobe PDF 1.7 reference (NOT ISO 
32000-1:2008 as I'm not sure what changed between Adobe and the fast-tracked 
ISO 32000-1:2008 version of PDF 1.7) as well as the "Adobe® Supplement to the 
ISO 32000 BaseVersion: 1.7 ExtensionLevel: 3" spec - 
[https://pdfa.org/resource/pdf-specification-index/]

 

* *EmbeddedFiles* is simply a list of file-spec dictionaries that point to 
streams and by themselves provide no context or semantics as to how the 
embedded file is used/referenced in PDF. You probably need to find the 
referenced file-spec in the PDF DOM to know how/what it is being used for. 
Collections need to use EmbeddedFiles but other features that rely on embedded 
file streams do not (but can if they wish)...

 

* Crypt filters can also apply just to embedded files (see AuthEvent EFOpen): 
"{*}AuthEvent{*} can also be {_}EFOpen{_}, which indicates the presence of an 
embedded file that is encrypted with a crypt filter that may be different from 
the crypt filters used by default to encrypt strings and streams in the 
document."

> Additional IRM-protected PDFs should throw EncryptedDocumentException
> ---------------------------------------------------------------------
>
>                 Key: TIKA-4175
>                 URL: https://issues.apache.org/jira/browse/TIKA-4175
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ross Johnson
>            Priority: Major
>         Attachments: image-2023-12-20-17-06-29-791.png, 
> image-2023-12-20-17-12-09-946.png
>
>
> I've come across some PDFs that use an Adobe IRM scheme, similar to 
> TIKA-4082, where a wrapper PDF contains an IRM-protected embedded PDF. These 
> wrapper PDFs do not currently throw because the structure is a bit different 
> than what is currently being looked for in PDFParser#checkEncryptedPayload().
> As best I can tell, this form of IRM was implemented by Adobe, but is 
> licensed to 3rd parties who then can market it as their own form of PDF 
> protection. The documents I've seen are from an IRM product from Interlinks, 
> but there are likely very similarly protected PDFs from other products.
> Opening the wrapper PDF in Adobe Reader / Acrobat prompts for a server 
> authentication (shown below). Opening in other viewers shows the wrapper 
> splash page, which indicates that the viewer is not secure and to use Adobe 
> Reader. !image-2023-12-20-17-06-29-791.png!
> The wrapper PDFs I've seen use PDF version 1.4 and have a somewhat generic 
> /EmbeddedFiles dictionary:  !image-2023-12-20-17-12-09-946.png!
> The encrypted PDF payloads I've seen have a somewhat interesting /Encrypt 
> dictionary with a Filter value of "Adobe.APS".
> {code:java}
> <<
>   /EDCData (...base64 string...)
>   /CF <<
>     /DefaultCryptFilter<</CFM/AESV3/Length 256>>
>   >>
>   /PDRLLic (...base64 string...)
>   /R 65537
>   /StmF /DefaultCryptFilter
>   /Filter /Adobe.APS
>   /EncryptMetadata true
>   /V 5
>   /StrF /DefaultCryptFilter
>   /PDRLPol (...base64 string...)
>   /SubFilter /adobe.pdrl.v0
> >>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to