[ https://issues.apache.org/jira/browse/TIKA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803357#comment-17803357 ]
Peter Wyatt commented on TIKA-4175: ----------------------------------- Without investing the specific PDF... * don't forget that Adobe originally defined Collections in PDF 1.7 and this is NOT entirely the same as the vendor-neutral Collections defined by ISO in PDF 2.0 - not the least because the Adobe version defined using FLASH/SWF for navigators (i.e. the UX). So check the Adobe PDF 1.7 reference (NOT ISO 32000-1:2008 as I'm not sure what changed between Adobe and the fast-tracked ISO 32000-1:2008 version of PDF 1.7) as well as the "Adobe® Supplement to the ISO 32000 BaseVersion: 1.7 ExtensionLevel: 3" spec - [https://pdfa.org/resource/pdf-specification-index/] * *EmbeddedFiles* is simply a list of file-spec dictionaries that point to streams and by themselves provide no context or semantics as to how the embedded file is used/referenced in PDF. You probably need to find the referenced file-spec in the PDF DOM to know how/what it is being used for. Collections need to use EmbeddedFiles but other features that rely on embedded file streams do not (but can if they wish)... * Crypt filters can also apply just to embedded files (see AuthEvent EFOpen): "{*}AuthEvent{*} can also be {_}EFOpen{_}, which indicates the presence of an embedded file that is encrypted with a crypt filter that may be different from the crypt filters used by default to encrypt strings and streams in the document." > Additional IRM-protected PDFs should throw EncryptedDocumentException > --------------------------------------------------------------------- > > Key: TIKA-4175 > URL: https://issues.apache.org/jira/browse/TIKA-4175 > Project: Tika > Issue Type: Bug > Reporter: Ross Johnson > Priority: Major > Attachments: image-2023-12-20-17-06-29-791.png, > image-2023-12-20-17-12-09-946.png > > > I've come across some PDFs that use an Adobe IRM scheme, similar to > TIKA-4082, where a wrapper PDF contains an IRM-protected embedded PDF. These > wrapper PDFs do not currently throw because the structure is a bit different > than what is currently being looked for in PDFParser#checkEncryptedPayload(). > As best I can tell, this form of IRM was implemented by Adobe, but is > licensed to 3rd parties who then can market it as their own form of PDF > protection. The documents I've seen are from an IRM product from Interlinks, > but there are likely very similarly protected PDFs from other products. > Opening the wrapper PDF in Adobe Reader / Acrobat prompts for a server > authentication (shown below). Opening in other viewers shows the wrapper > splash page, which indicates that the viewer is not secure and to use Adobe > Reader. !image-2023-12-20-17-06-29-791.png! > The wrapper PDFs I've seen use PDF version 1.4 and have a somewhat generic > /EmbeddedFiles dictionary: !image-2023-12-20-17-12-09-946.png! > The encrypted PDF payloads I've seen have a somewhat interesting /Encrypt > dictionary with a Filter value of "Adobe.APS". > {code:java} > << > /EDCData (...base64 string...) > /CF << > /DefaultCryptFilter<</CFM/AESV3/Length 256>> > >> > /PDRLLic (...base64 string...) > /R 65537 > /StmF /DefaultCryptFilter > /Filter /Adobe.APS > /EncryptMetadata true > /V 5 > /StrF /DefaultCryptFilter > /PDRLPol (...base64 string...) > /SubFilter /adobe.pdrl.v0 > >> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)