[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694741#comment-13694741
 ] 

Nick Burch commented on TIKA-973:
-

Patch looks promising to me, but I don't know enough about PDF so I've not been 
able to give it a thorough review

Let's give it a few days before applying, to give others a chance to offer 
feedback

One thing that might be good is in the unit test, to check for data from each 
form in turn, so we cover more cases

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694774#comment-13694774
 ] 

Tim Allison commented on TIKA-973:
--

Agree on both.  Also would appreciate feedback on what the output should be.  
The current code extracts this unseemly xhtml:

div class=acroform
 ol   li partialName=form1[0] fullName=form1[0]/
 ol   li partialName=#subform[6] fullName=form1[0].#subform[6]/
li partialName=MiddleInitial[0] 
fullName=form1[0].#subform[6].MiddleInitial[0] altName=Enter Middle Initial 
(MI)X/li
 li partialName=FamilyName[0] 
fullName=form1[0].#subform[6].FamilyName[0] altName=Section 1. Employee 
Information and Attestation.  Family Name (Last Name)Doe/li
li partialName=GivenName[0] 
fullName=form1[0].#subform[6].GivenName[0] altName=Given Name (First 
Name)John/li
li partialName=OtherNamesUsed[0] 
fullName=form1[0].#subform[6].OtherNamesUsed[0] altName=Maiden NameMr. 
Doe/li
li partialName=StreetNumberName[0] 
fullName=form1[0].#subform[6].StreetNumberName[0] altName= Street Number and 
Name123 Main St./li


...

Another idea I had was to include the partialName in the contents and not fill 
out the attrs:
liStreetNumberName[0]: 123 Main St/li

More unit tests on way...

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694782#comment-13694782
 ] 

Nick Burch commented on TIKA-973:
-

To make reviewing easier, it might be handy if you could upload a PNG 
screenshot of one of these forms, so it's quick to view that alongside 
suggested html

I'd be minded to go for something like:
  li title=Street Number and NameStreetNumberName[0]: 123 Main St/li

So we'd have the alt name, the partial name, the value, but not the full name 
(but we would have the form/subform name elsewhere)

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
 Attachments: TIKA-973-patch.tar.gz


 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.

2013-06-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693068#comment-13693068
 ] 

Tim Allison commented on TIKA-973:
--

Will submit patch and tests by end of the week.

 PDF form data isn't included in extracted content.
 --

 Key: TIKA-973
 URL: https://issues.apache.org/jira/browse/TIKA-973
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor

 When extracting content from PDFs, PDF form data isn't extracted. 
 The following code extracts this data via PDF box, but it seems like 
 something Tika should be doing.
 PDDocumentCatalog docCatalog = load.getDocumentCatalog();
 if (docCatalog != null) {
   PDAcroForm acroForm = docCatalog.getAcroForm();
   if (acroForm != null) {
   @SuppressWarnings(unchecked)
   ListPDField fields = acroForm.getFields();
   if (fields != null  fields.size()  0) {
 documentContent.append( );
 for (PDField field : fields) {
   if (field.getValue()!=null) {
 documentContent.append(field.getValue());
 documentContent.append( );
   }
 }
   }
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira