[
https://issues.apache.org/jira/browse/TIKA-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969608#action_12969608
]
Nick Burch commented on TIKA-267:
---------------------------------
The issue with the canExtractContent check is that you will often end up with
garbage metadata (eg TIKA-389)
I think we want to always decrypt if we can, and that certainly fixes things
for the test documents I have. However, if you have a document that this then
breaks, could you please upload it? (At the moment, we don't have a unit test
for your use case so I can't be sure what change this will have)
> encrypted pdf files aren't handled properly
> -------------------------------------------
>
> Key: TIKA-267
> URL: https://issues.apache.org/jira/browse/TIKA-267
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.4
> Environment: Ubuntu Linux 8.10, JRE 1.5
> Reporter: Sascha Szott
> Assignee: Jukka Zitting
> Priority: Critical
> Fix For: 0.5
>
> Original Estimate: 0.08h
> Remaining Estimate: 0.08h
>
> While I was working on extracting full texts out of a bunch of pdf documents,
> I realized an odd behaviour of Tika when processing encrypted documents
> (those documents that restrict the execution of specific actions, e.g.
> editing or printing). To extract content from an encrypted pdf document you
> do not have to decrypt the document in every case. For instance, when
> creating an (encrypted) pdf document the author can decide to allow content
> extraction without the need of providing a password. Unfortunately, Tika's
> pdf parser isn't aware of this at the moment. Therefore, I suggest a minor
> change inside the parse method in class org.apache.tika.parser.pdf.PDFParser
> by introducing an additional check ("is copying allowed") before trying to
> decrypt the document.
> To be more precise, I'll provide a code snippet:
> public void parse(...) throws ... {
> PDDocument pdfDocument = PDDocument.load(stream);
> try {
> //decrypt document only if copying is not allowed
> if (!pdfDocument.getCurrentAccessPermission().canExtractContent()) {
> if (pdfDocument.isEncrypted()) {
> try {
> pdfDocument.decrypt("");
> } catch (Exception e) {
> // Ignore
> }
> }
> }
> ...
> Another solution to this problem would be to eliminate the "isEncrypted"
> check since PDFBox seems to handle the extraction of content out of encrypted
> documents correctly (and throws an IOException in case of failure).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.