[jira] [Comment Edited] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-24 Thread Eric Knauel (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880877#comment-13880877
 ] 

Eric Knauel edited comment on TIKA-1226 at 1/24/14 10:34 AM:
-

Tim,

I've uploaded a test PDF document that contains a field of type signature. 
That's not the original document that led to the error but it produces the same 
stack trace. The form data is empty and document is not signed. Somehow I 
couldn't get the PDF editor to set the permission for signing to true. (This 
permission was enabled in my original document, but I can't publish the 
document.)

I think it would be useful to grab information on the signer from the PDF!


was (Author: uknela):
PDF test file with a form that contains a field of type signature.

 PDFTextStripper fails while getting data of PDF form fields of type 
 PDSignatureField
 

 Key: TIKA-1226
 URL: https://issues.apache.org/jira/browse/TIKA-1226
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Eric Knauel
Assignee: Tim Allison
 Attachments: pdf-form-with-signature-field-empty.pdf


 I have a PDF document that contains a filled in form. Among the various 
 fields of type text and radio button there are multiple fields for digital 
 signatures. When I load this document into tika-app I get the following 
 exception:
 {noformat}
 Caused by: java.lang.RuntimeException: Can't get signature as String, use 
 getSignature() instead.
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 43 more
 {noformat}
 The problem seems to be that PDF2XHTML seems to expect that it can call 
 getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
 this is not true for the sub class PDSignatureField:
 http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
 The java doc says that getSignature() should be called instead. 
 Assuming that the information inside the signature is not relevant for the 
 extraction process and can be discarded the following patch helps:
 {noformat}
 Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
 IDEA additional info:
 Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
 +UTF-8
 ===
 --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision 1560617)
 +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision )
 @@ -40,6 +40,7 @@
  import 
 org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
 +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
  import org.apache.pdfbox.util.PDFTextStripper;
  import org.apache.pdfbox.util.TextPosition;
  import org.apache.tika.exception.TikaException;
 @@ -464,7 +465,9 @@
}
String value = ;
try {
 +  if (!(field instanceof PDSignatureField)) {
 -  value = field.getValue();
 +  value = field.getValue();
 +  }
} catch (IOException e) {
 //swallow
}
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881383#comment-13881383
 ] 

Tim Allison edited comment on TIKA-1226 at 1/24/14 8:22 PM:


Thank you for the test file.  I'll use that in the formal test.  I used another 
doc for dev that I unfortunately can't share.  Does this format look good?  My 
dev doc only had name and date, but the other info would also show up if it 
existed...

{noformat}
div class=acroform
ol
li altName=nameName: my name/li
li
ol type=signaturedata   
li 
signdata=date2014-01-17T11:57:26-0500/li
li signdata=namemy name/li
/ol
/li
/ol
/div
{noformat}


was (Author: talli...@mitre.org):
Thank you for the test file.  I'll use that in the formal test.  I used another 
doc for dev that I unfortunately can't share.  Does this format look good?  My 
dev doc only had name and date, but the other info would also show up if it 
existed...

{noformat}
div class=acroform
ol
li altName=nameName: my name/li
li
ol type=signaturedata   
li 
signdata=date2014-01-17T11:57:26-0500/li
li signdata=namemy name/li
/ol
/li
/ol
/div
{noformat}

 PDFTextStripper fails while getting data of PDF form fields of type 
 PDSignatureField
 

 Key: TIKA-1226
 URL: https://issues.apache.org/jira/browse/TIKA-1226
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Eric Knauel
Assignee: Tim Allison
 Attachments: pdf-form-with-signature-field-empty.pdf


 I have a PDF document that contains a filled in form. Among the various 
 fields of type text and radio button there are multiple fields for digital 
 signatures. When I load this document into tika-app I get the following 
 exception:
 {noformat}
 Caused by: java.lang.RuntimeException: Can't get signature as String, use 
 getSignature() instead.
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 43 more
 {noformat}
 The problem seems to be that PDF2XHTML seems to expect that it can call 
 getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
 this is not true for the sub class PDSignatureField:
 http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
 The java doc says that getSignature() should be called instead. 
 Assuming that the information inside the signature is not relevant for the 
 extraction process and can be discarded the following patch helps:
 {noformat}
 Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
 IDEA additional info:
 Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
 +UTF-8
 ===
 --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision 1560617)
 +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
 (revision )
 @@ -40,6 +40,7 @@
  import 
 org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
 +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
  import org.apache.pdfbox.util.PDFTextStripper;
  import org.apache.pdfbox.util.TextPosition;
  import org.apache.tika.exception.TikaException;
 @@ -464,7 +465,9 @@
}
String value = ;
try {
 +  if (!(field instanceof PDSignatureField)) {
 -  value = field.getValue();
 +  value = field.getValue();
 +  }
} catch (IOException e) {
 //swallow
}
 {noformat}



--
This message was sent