[jira] [Comment Edited] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880877#comment-13880877 ] Eric Knauel edited comment on TIKA-1226 at 1/24/14 10:34 AM: - Tim, I've uploaded a test PDF document that contains a field of type signature. That's not the original document that led to the error but it produces the same stack trace. The form data is empty and document is not signed. Somehow I couldn't get the PDF editor to set the permission for signing to true. (This permission was enabled in my original document, but I can't publish the document.) I think it would be useful to grab information on the signer from the PDF! was (Author: uknela): PDF test file with a form that contains a field of type signature. PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField Key: TIKA-1226 URL: https://issues.apache.org/jira/browse/TIKA-1226 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Eric Knauel Assignee: Tim Allison Attachments: pdf-form-with-signature-field-empty.pdf I have a PDF document that contains a filled in form. Among the various fields of type text and radio button there are multiple fields for digital signatures. When I load this document into tika-app I get the following exception: {noformat} Caused by: java.lang.RuntimeException: Can't get signature as String, use getSignature() instead. at org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131) at org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425) at org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411) at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 43 more {noformat} The problem seems to be that PDF2XHTML seems to expect that it can call getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this is not true for the sub class PDSignatureField: http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html The java doc says that getSignature() should be called instead. Assuming that the information inside the signature is not relevant for the extraction process and can be discarded the following patch helps: {noformat} Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP +UTF-8 === --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision 1560617) +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision ) @@ -40,6 +40,7 @@ import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; import org.apache.tika.exception.TikaException; @@ -464,7 +465,9 @@ } String value = ; try { + if (!(field instanceof PDSignatureField)) { - value = field.getValue(); + value = field.getValue(); + } } catch (IOException e) { //swallow } {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField
[ https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881383#comment-13881383 ] Tim Allison edited comment on TIKA-1226 at 1/24/14 8:22 PM: Thank you for the test file. I'll use that in the formal test. I used another doc for dev that I unfortunately can't share. Does this format look good? My dev doc only had name and date, but the other info would also show up if it existed... {noformat} div class=acroform ol li altName=nameName: my name/li li ol type=signaturedata li signdata=date2014-01-17T11:57:26-0500/li li signdata=namemy name/li /ol /li /ol /div {noformat} was (Author: talli...@mitre.org): Thank you for the test file. I'll use that in the formal test. I used another doc for dev that I unfortunately can't share. Does this format look good? My dev doc only had name and date, but the other info would also show up if it existed... {noformat} div class=acroform ol li altName=nameName: my name/li li ol type=signaturedata li signdata=date2014-01-17T11:57:26-0500/li li signdata=namemy name/li /ol /li /ol /div {noformat} PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField Key: TIKA-1226 URL: https://issues.apache.org/jira/browse/TIKA-1226 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Eric Knauel Assignee: Tim Allison Attachments: pdf-form-with-signature-field-empty.pdf I have a PDF document that contains a filled in form. Among the various fields of type text and radio button there are multiple fields for digital signatures. When I load this document into tika-app I get the following exception: {noformat} Caused by: java.lang.RuntimeException: Can't get signature as String, use getSignature() instead. at org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131) at org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425) at org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411) at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 43 more {noformat} The problem seems to be that PDF2XHTML seems to expect that it can call getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this is not true for the sub class PDSignatureField: http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html The java doc says that getSignature() should be called instead. Assuming that the information inside the signature is not relevant for the extraction process and can be discarded the following patch helps: {noformat} Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP +UTF-8 === --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision 1560617) +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (revision ) @@ -40,6 +40,7 @@ import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.TextPosition; import org.apache.tika.exception.TikaException; @@ -464,7 +465,9 @@ } String value = ; try { + if (!(field instanceof PDSignatureField)) { - value = field.getValue(); + value = field.getValue(); + } } catch (IOException e) { //swallow } {noformat} -- This message was sent