[jira] Commented: (TIKA-267) encrypted pdf files aren't handled properly

Nick Burch (JIRA) Wed, 08 Dec 2010 18:06:26 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969608#action_12969608
 ]


Nick Burch commented on TIKA-267:
---------------------------------

The issue with the canExtractContent check is that you will often end up with 
garbage metadata (eg TIKA-389)

I think we want to always decrypt if we can, and that certainly fixes things 
for the test documents I have. However, if you have a document that this then 
breaks, could you please upload it? (At the moment, we don't have a unit test 
for your use case so I can't be sure what change this will have)

> encrypted pdf files aren't handled properly
> -------------------------------------------
>
>                 Key: TIKA-267
>                 URL: https://issues.apache.org/jira/browse/TIKA-267
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux 8.10, JRE 1.5
>            Reporter: Sascha Szott
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.5
>
>   Original Estimate: 0.08h
>  Remaining Estimate: 0.08h
>
> While I was working on extracting full texts out of a bunch of pdf documents, 
> I realized an odd behaviour of Tika when processing encrypted documents 
> (those documents that restrict the execution of specific actions, e.g. 
> editing or printing). To extract content from an encrypted pdf document you 
> do not have to decrypt the document in every case. For instance, when 
> creating an (encrypted) pdf document the author can decide to allow content 
> extraction without the need of providing a password. Unfortunately, Tika's 
> pdf parser isn't aware of this at the moment. Therefore, I suggest a minor 
> change inside the parse method in class org.apache.tika.parser.pdf.PDFParser 
> by introducing an additional check ("is copying allowed") before trying to 
> decrypt the document.
> To be more precise, I'll provide a code snippet:
> public void parse(...) throws ... {
>   PDDocument pdfDocument = PDDocument.load(stream);
>   try {
>     //decrypt document only if copying is not allowed
>     if (!pdfDocument.getCurrentAccessPermission().canExtractContent()) {
>       if (pdfDocument.isEncrypted()) {
>         try {
>           pdfDocument.decrypt("");
>         } catch (Exception e) {
>           // Ignore
>         }
>       }
>     }
>     ...
> Another solution to this problem would be to eliminate the "isEncrypted" 
> check since PDFBox seems to handle the extraction of content out of encrypted 
> documents correctly (and throws an IOException in case of failure).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-267) encrypted pdf files aren't handled properly

Reply via email to