[jira] [Commented] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-27 Thread Eric Knauel (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882661#comment-13882661
 ] 

Eric Knauel commented on TIKA-1226:
---

Tim, that looks very good to me!

> PDFTextStripper fails while getting data of PDF form fields of type 
> PDSignatureField
> 
>
> Key: TIKA-1226
> URL: https://issues.apache.org/jira/browse/TIKA-1226
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Eric Knauel
>Assignee: Tim Allison
> Attachments: pdf-form-with-signature-field-empty.pdf
>
>
> I have a PDF document that contains a filled in form. Among the various 
> fields of type text and radio button there are multiple fields for digital 
> signatures. When I load this document into tika-app I get the following 
> exception:
> {noformat}
> Caused by: java.lang.RuntimeException: Can't get signature as String, use 
> getSignature() instead.
>   at 
> org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
>   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 43 more
> {noformat}
> The problem seems to be that PDF2XHTML seems to expect that it can call 
> getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
> this is not true for the sub class PDSignatureField:
> http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
> The java doc says that getSignature() should be called instead. 
> Assuming that the information inside the signature is not relevant for the 
> extraction process and can be discarded the following patch helps:
> {noformat}
> Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===
> --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
> (revision 1560617)
> +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
> (revision )
> @@ -40,6 +40,7 @@
>  import 
> org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
>  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
>  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
> +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
>  import org.apache.pdfbox.util.PDFTextStripper;
>  import org.apache.pdfbox.util.TextPosition;
>  import org.apache.tika.exception.TikaException;
> @@ -464,7 +465,9 @@
>}
>String value = "";
>try {
> +  if (!(field instanceof PDSignatureField)) {
> -  value = field.getValue();
> +  value = field.getValue();
> +  }
>} catch (IOException e) {
> //swallow
>}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-24 Thread Eric Knauel (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880877#comment-13880877
 ] 

Eric Knauel edited comment on TIKA-1226 at 1/24/14 10:34 AM:
-

Tim,

I've uploaded a test PDF document that contains a field of type signature. 
That's not the original document that led to the error but it produces the same 
stack trace. The form data is empty and document is not signed. Somehow I 
couldn't get the PDF editor to set the permission for signing to true. (This 
permission was enabled in my original document, but I can't publish the 
document.)

I think it would be useful to grab information on the signer from the PDF!


was (Author: uknela):
PDF test file with a form that contains a field of type signature.

> PDFTextStripper fails while getting data of PDF form fields of type 
> PDSignatureField
> 
>
> Key: TIKA-1226
> URL: https://issues.apache.org/jira/browse/TIKA-1226
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Eric Knauel
>Assignee: Tim Allison
> Attachments: pdf-form-with-signature-field-empty.pdf
>
>
> I have a PDF document that contains a filled in form. Among the various 
> fields of type text and radio button there are multiple fields for digital 
> signatures. When I load this document into tika-app I get the following 
> exception:
> {noformat}
> Caused by: java.lang.RuntimeException: Can't get signature as String, use 
> getSignature() instead.
>   at 
> org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
>   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 43 more
> {noformat}
> The problem seems to be that PDF2XHTML seems to expect that it can call 
> getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
> this is not true for the sub class PDSignatureField:
> http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
> The java doc says that getSignature() should be called instead. 
> Assuming that the information inside the signature is not relevant for the 
> extraction process and can be discarded the following patch helps:
> {noformat}
> Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===
> --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
> (revision 1560617)
> +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
> (revision )
> @@ -40,6 +40,7 @@
>  import 
> org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
>  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
>  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
> +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
>  import org.apache.pdfbox.util.PDFTextStripper;
>  import org.apache.pdfbox.util.TextPosition;
>  import org.apache.tika.exception.TikaException;
> @@ -464,7 +465,9 @@
>}
>String value = "";
>try {
> +  if (!(field instanceof PDSignatureField)) {
> -  value = field.getValue();
> +  value = field.getValue();
> +  }
>} catch (IOException e) {
> //swallow
>}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-24 Thread Eric Knauel (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Knauel updated TIKA-1226:
--

Attachment: pdf-form-with-signature-field-empty.pdf

PDF test file with a form that contains a field of type signature.

> PDFTextStripper fails while getting data of PDF form fields of type 
> PDSignatureField
> 
>
> Key: TIKA-1226
> URL: https://issues.apache.org/jira/browse/TIKA-1226
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Eric Knauel
>Assignee: Tim Allison
> Attachments: pdf-form-with-signature-field-empty.pdf
>
>
> I have a PDF document that contains a filled in form. Among the various 
> fields of type text and radio button there are multiple fields for digital 
> signatures. When I load this document into tika-app I get the following 
> exception:
> {noformat}
> Caused by: java.lang.RuntimeException: Can't get signature as String, use 
> getSignature() instead.
>   at 
> org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
>   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 43 more
> {noformat}
> The problem seems to be that PDF2XHTML seems to expect that it can call 
> getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc 
> this is not true for the sub class PDSignatureField:
> http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html
> The java doc says that getSignature() should be called instead. 
> Assuming that the information inside the signature is not relevant for the 
> extraction process and can be discarded the following patch helps:
> {noformat}
> Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> IDEA additional info:
> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> <+>UTF-8
> ===
> --- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
> (revision 1560617)
> +++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java  
> (revision )
> @@ -40,6 +40,7 @@
>  import 
> org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
>  import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
>  import org.apache.pdfbox.pdmodel.interactive.form.PDField;
> +import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
>  import org.apache.pdfbox.util.PDFTextStripper;
>  import org.apache.pdfbox.util.TextPosition;
>  import org.apache.tika.exception.TikaException;
> @@ -464,7 +465,9 @@
>}
>String value = "";
>try {
> +  if (!(field instanceof PDSignatureField)) {
> -  value = field.getValue();
> +  value = field.getValue();
> +  }
>} catch (IOException e) {
> //swallow
>}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-23 Thread Eric Knauel (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Knauel updated TIKA-1226:
--

Description: 
I have a PDF document that contains a filled in form. Among the various fields 
of type text and radio button there are multiple fields for digital signatures. 
When I load this document into tika-app I get the following exception:

{noformat}
Caused by: java.lang.RuntimeException: Can't get signature as String, use 
getSignature() instead.
at 
org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
at 
org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
at 
org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 43 more
{noformat}

The problem seems to be that PDF2XHTML seems to expect that it can call 
getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this 
is not true for the sub class PDSignatureField:

http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html

The java doc says that getSignature() should be called instead. 

Assuming that the information inside the signature is not relevant for the 
extraction process and can be discarded the following patch helps:

{noformat}
Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
--- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
(revision 1560617)
+++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
(revision )
@@ -40,6 +40,7 @@
 import 
org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
 import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
 import org.apache.pdfbox.pdmodel.interactive.form.PDField;
+import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
 import org.apache.pdfbox.util.PDFTextStripper;
 import org.apache.pdfbox.util.TextPosition;
 import org.apache.tika.exception.TikaException;
@@ -464,7 +465,9 @@
   }
   String value = "";
   try {
+  if (!(field instanceof PDSignatureField)) {
-  value = field.getValue();
+  value = field.getValue();
+  }
   } catch (IOException e) {
//swallow
   }
{noformat}



  was:
I have a PDF document that contains a filled in form. Among the various fields 
of type text and radio button there are multiple fields for digital signatures. 
When I load this document into tika-app I get the following exception:

Caused by: java.lang.RuntimeException: Can't get signature as String, use 
getSignature() instead.
at 
org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
at 
org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
at 
org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 43 more

The problem seems to be that PDF2XHTML seems to expect that it can call 
getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this 
is not true for the sub class PDSignatureField:

http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html

The java doc says that getSignature() should be called instead. 

Assuming that the information inside the signature is not relevant for the 
extraction process and can be discarded the following patch helps:

Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
--- tika-parsers/src/main/java/org/apache/t

[jira] [Created] (TIKA-1226) PDFTextStripper fails while getting data of PDF form fields of type PDSignatureField

2014-01-23 Thread Eric Knauel (JIRA)
Eric Knauel created TIKA-1226:
-

 Summary: PDFTextStripper fails while getting data of PDF form 
fields of type PDSignatureField
 Key: TIKA-1226
 URL: https://issues.apache.org/jira/browse/TIKA-1226
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Eric Knauel


I have a PDF document that contains a filled in form. Among the various fields 
of type text and radio button there are multiple fields for digital signatures. 
When I load this document into tika-app I get the following exception:

Caused by: java.lang.RuntimeException: Can't get signature as String, use 
getSignature() instead.
at 
org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField.getValue(PDSignatureField.java:131)
at 
org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:467)
at 
org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:425)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:411)
at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:184)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:330)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:95)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 43 more

The problem seems to be that PDF2XHTML seems to expect that it can call 
getValue() on all PDField objects. According to the PDFBox 1.8.3 java doc this 
is not true for the sub class PDSignatureField:

http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/interactive/form/PDSignatureField.html

The java doc says that getSignature() should be called instead. 

Assuming that the information inside the signature is not relevant for the 
extraction process and can be discarded the following patch helps:

Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
--- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
(revision 1560617)
+++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
(revision )
@@ -40,6 +40,7 @@
 import 
org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
 import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
 import org.apache.pdfbox.pdmodel.interactive.form.PDField;
+import org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField;
 import org.apache.pdfbox.util.PDFTextStripper;
 import org.apache.pdfbox.util.TextPosition;
 import org.apache.tika.exception.TikaException;
@@ -464,7 +465,9 @@
   }
   String value = "";
   try {
+  if (!(field instanceof PDSignatureField)) {
-  value = field.getValue();
+  value = field.getValue();
+  }
   } catch (IOException e) {
//swallow
   }






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)