[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-07-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-973:
-

Attachment: TIKA-973.patch.tar.gz

Middle-road change made.  The alternate name is an attribute and partial name 
is added to content followed by a :.

I also added a few more tests.

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz, 
 TIKA-973.patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-973:
-

Attachment: i-9_screenshot.png

Screenshot attached.  Thanks again to: 
http://benlitchfield.sys-con.com/node/48543?page=0,1 for the code example and 
example doc.

The middle ground that you recommend makes sense.



 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-26 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-973:
-

Attachment: TIKA-973-patch.tar.gz

Patch attached.  Dumps contents of pdf forms at end of document.  

AcroForm field name metadata is in attribute values.  Basic format is ol.

Let me know how this looks.

Thank you, Ben Litchfield, for org.apache.pdfbox.examples.fdf.PrintFields


 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira