[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623572#action_12623572
 ] 

Guillaume Smet commented on NUTCH-643:
--------------------------------------

Hi Andrzej,

This problem is also fixed in the non-Apache repository of PDFBox (on sf.net - 
the link I posted is from the sf.net CVS tree). I don't know though if you can 
build and ship a non released version of PDFBox according to ASF release policy.

Even if we can't solve it in the Nutch tree right now, the problem is now 
referenced and people can solve it by themselves quite easily.

Regards,

-- 
Guillaume



> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should 
> decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
> this deprecated API (we have a ClassCastException in PDFBox) as we have the 
> following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
> http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
> handled as pdf document. java.lang.ClassCastException: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and 
> we can get its content:
>                       if (pdf.isEncrypted()) {
>                               // Just try using the default password and move 
> on
>                               pdf.openProtection(new 
> StandardDecryptionMaterial(""));
>                       }
> I attached the patch fixing this problem: it works perfectly with the above 
> document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to