[
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880130#comment-13880130
]
Tim Allison commented on TIKA-1226:
-----------------------------------
Eric,
Thank you for reporting this. I'll make the fix shortly. Are you able to
share your document as a test case? Thank you, again.
> PDFTextStripper fails while getting data of PDF form fields of type
> PDSignatureField
> ------------------------------------------------------------------------------------
>
> Key: TIKA-1226
> URL: https://issues.apache.org/jira/browse/TIKA-1226
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.5
> Reporter: Eric Knauel
> Assignee: Tim Allison
>
> I have a PDF document that contains a filled in form. Among the various
> fields of type text and radio button there are multiple fields for digital
> signatures. When I load this document into tika-app I get the following
> exception:
> {noformat}
> Caused by: java.lang.RuntimeException: Can't get signature as String, use
> getSignature() instead.
> at
> org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
> at
> org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
> at
> org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
> at
> org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
> at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 43 more
> {noformat}
> The problem seems to be that PDF2XHTML seems to expect that it can call
> getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc
> this is not true for the sub class PDSignatureField:
> http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
> The java doc says that getSignature() should be called instead.
> Assuming that the information inside the signature is not relevant for the
> extraction process and can be discarded the following patch helps:
> {noformat}
> Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===================================================================
> --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> (revision 1560617)
> +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> (revision )
> @@ -40,6 +40,7 @@
> import
> org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
> import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
> import org.apache.pdfbox.pdmodel.interactive.form.PDField;
> +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
> import org.apache.pdfbox.util.PDFTextStripper;
> import org.apache.pdfbox.util.TextPosition;
> import org.apache.tika.exception.TikaException;
> @@ -464,7 +465,9 @@
> }
> String value = "";
> try {
> + if (!(field instanceof PDSignatureField)) {
> - value = field.getValue();
> + value = field.getValue();
> + }
> } catch (IOException e) {
> //swallow
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)