[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668010#action_12668010
 ] 

Guillaume Smet commented on NUTCH-643:
--------------------------------------

Hi Doğacan,

The problem isn't the license of PDFBox which is already included in Nutch. 
It's more than PDFBox is on its way to become an Apache project (it's in the 
incubator - see http://incubator.apache.org/pdfbox/) and it seems that you 
can't include a library which is in the incubator.

So you can either wait for PDFBox to be a real Apache project or build a 
development version of the latest PDFBox tree which is on sourceforge.net, 
which is what I did (the problem is fixed in the sf.net tree) but you then have 
a development version in the Nutch tree and not a stable release: I'm not sure 
it's acceptable.

It's more a problem of release policy and release rules than a technical or 
license problem.

-- 
Guillaume

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should 
> decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
> this deprecated API (we have a ClassCastException in PDFBox) as we have the 
> following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
> http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
> handled as pdf document. java.lang.ClassCastException: 
> org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
> org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and 
> we can get its content:
>                       if (pdf.isEncrypted()) {
>                               // Just try using the default password and move 
> on
>                               pdf.openProtection(new 
> StandardDecryptionMaterial(""));
>                       }
> I attached the patch fixing this problem: it works perfectly with the above 
> document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to