[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620955#action_12620955 ]
Guillaume Smet commented on NUTCH-643: -------------------------------------- In fact, the problem is more complex than an API problem and is solved in current PDFBox trunk (from Apache incubator). I used the revision 683874 . I made the following changes: - upgrade from FontBox-0.1-dev to FontBox-0.2-dev (shipped in PDFBox lib/ directory) - upgrade from PDFBox-0.7.3 to PDFBox-0.7.4-dev (rev: 683874) - copy bcprov-jdk14-132.jar, bcmail-jdk14-132.jar and their licence to parse-pdf lib/ directory: the license seems to be compatible with Apache license (I took the jars from PDFBox trunk) - fix the deprecation issues in PdfParser I had a lot of errors indexing a bunch of PDF files from several websites. After this upgrade, it's far far better: I don't have any ClassCastException issues in PDFBox anymore (they fixed them in the current trunk, for example see this patch from Feb 2007: http://pdfbox.cvs.sourceforge.net/pdfbox/pdfbox/src/org/pdfbox/filter/FlateFilter.java?r1=1.10&r2=1.11 ). Patch attached. The patch doesn't contain the jars but they are referenced in the patch for completeness. I can add them if needed. > ClassCastException in PdfParser on encrypted PDF with empty password > -------------------------------------------------------------------- > > Key: NUTCH-643 > URL: https://issues.apache.org/jira/browse/NUTCH-643 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.9.0 > Environment: This problem affects the current trunk too. > Reporter: Guillaume Smet > > Hi, > If a PDF document is encrypted with an empty password, the PdfParser should > decrypt it using the empty password. > This behaviour is implemented with the following code: > if (pdf.isEncrypted()) { > DocumentEncryption decryptor = new DocumentEncryption(pdf); > //Just try using the default password and move on > decryptor.decryptDocument(""); > } > It uses a deprecated API and moreover it seems there is a bug in PDFBox in > this deprecated API (we have a ClassCastException in PDFBox) as we have the > following error: > 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: > org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to > org.pdfbox.pdmodel.encryption.PDStandardEncryption > 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: > org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to > org.pdfbox.pdmodel.encryption.PDStandardEncryption > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) > 2008-08-07 19:15:56,862 WARN parse.pdf - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) > 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: > http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be > handled as pdf document. java.lang.ClassCastException: > org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to > org.pdfbox.pdmodel.encryption.PDStandardEncryption > Using the new security API, we don't have any error parsing this document and > we can get its content: > if (pdf.isEncrypted()) { > // Just try using the default password and move > on > pdf.openProtection(new > StandardDecryptionMaterial("")); > } > I attached the patch fixing this problem: it works perfectly with the above > document and get rids of the deprecated API. > Regards, > -- > Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.