[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620955#action_12620955
 ] 

Guillaume Smet commented on NUTCH-643:
--------------------------------------

In fact, the problem is more complex than an API problem and is solved in 
current PDFBox trunk (from Apache incubator). I used the revision 683874 .

I made the following changes:
- upgrade from FontBox-0.1-dev to FontBox-0.2-dev (shipped in PDFBox lib/ 
directory)
- upgrade from PDFBox-0.7.3 to PDFBox-0.7.4-dev (rev: 683874)
- copy bcprov-jdk14-132.jar, bcmail-jdk14-132.jar and their licence to 
parse-pdf lib/ directory: the license seems to be compatible with Apache 
license (I took the jars from PDFBox trunk)
- fix the deprecation issues in PdfParser

I had a lot of errors indexing a bunch of PDF files from several websites. 
After this upgrade, it's far far better: I don't have any ClassCastException 
issues in PDFBox anymore (they fixed them in the current trunk, for example see 
this patch from Feb 2007: 
http://pdfbox.cvs.sourceforge.net/pdfbox/pdfbox/src/org/pdfbox/filter/FlateFilter.java?r1=1.10&r2=1.11
 ).

Patch attached. The patch doesn't contain the jars but they are referenced in 
the patch for completeness. I can add them if needed.

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should 
> decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
> this deprecated API (we have a ClassCastException in PDFBox) as we have the 
> following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
> http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
> handled as pdf document. java.lang.ClassCastException: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and 
> we can get its content:
>                       if (pdf.isEncrypted()) {
>                               // Just try using the default password and move 
> on
>                               pdf.openProtection(new 
> StandardDecryptionMaterial(""));
>                       }
> I attached the patch fixing this problem: it works perfectly with the above 
> document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to