[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668010#action_12668010 ]
Guillaume Smet commented on NUTCH-643: -------------------------------------- Hi Doğacan, The problem isn't the license of PDFBox which is already included in Nutch. It's more than PDFBox is on its way to become an Apache project (it's in the incubator - see http://incubator.apache.org/pdfbox/) and it seems that you can't include a library which is in the incubator. So you can either wait for PDFBox to be a real Apache project or build a development version of the latest PDFBox tree which is on sourceforge.net, which is what I did (the problem is fixed in the sf.net tree) but you then have a development version in the Nutch tree and not a stable release: I'm not sure it's acceptable. It's more a problem of release policy and release rules than a technical or license problem. -- Guillaume > ClassCastException in PdfParser on encrypted PDF with empty password > -------------------------------------------------------------------- > > Key: NUTCH-643 > URL: https://issues.apache.org/jira/browse/NUTCH-643 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.0.0 > Environment: This problem affects the current trunk too. > Reporter: Guillaume Smet > Attachments: parse-pdf-PDFBox_upgrade.diff > > > Hi, > If a PDF document is encrypted with an empty password, the PdfParser should > decrypt it using the empty password. > This behaviour is implemented with the following code: > if (pdf.isEncrypted()) { > DocumentEncryption decryptor = new DocumentEncryption(pdf); > //Just try using the default password and move on > decryptor.decryptDocument(""); > } > It uses a deprecated API and moreover it seems there is a bug in PDFBox in > this deprecated API (we have a ClassCastException in PDFBox) as we have the > following error: > 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: > org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to > org.pdfbox.pdmodel.encryption.PDStandardEncryption > 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: > org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to > org.pdfbox.pdmodel.encryption.PDStandardEncryption > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) > 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: > http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be > handled as pdf document. java.lang.ClassCastException: > org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to > org.pdfbox.pdmodel.encryption.PDStandardEncryption > Using the new security API, we don't have any error parsing this document and > we can get its content: > if (pdf.isEncrypted()) { > // Just try using the default password and move > on > pdf.openProtection(new > StandardDecryptionMaterial("")); > } > I attached the patch fixing this problem: it works perfectly with the above > document and get rids of the deprecated API. > Regards, > -- > Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.