[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694741#comment-13694741 ] Nick Burch commented on TIKA-973: - Patch looks promising to me, but I don't know enough about PDF so I've not been able to give it a thorough review Let's give it a few days before applying, to give others a chance to offer feedback One thing that might be good is in the unit test, to check for data from each form in turn, so we cover more cases PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694774#comment-13694774 ] Tim Allison commented on TIKA-973: -- Agree on both. Also would appreciate feedback on what the output should be. The current code extracts this unseemly xhtml: div class=acroform ol li partialName=form1[0] fullName=form1[0]/ ol li partialName=#subform[6] fullName=form1[0].#subform[6]/ li partialName=MiddleInitial[0] fullName=form1[0].#subform[6].MiddleInitial[0] altName=Enter Middle Initial (MI)X/li li partialName=FamilyName[0] fullName=form1[0].#subform[6].FamilyName[0] altName=Section 1. Employee Information and Attestation. Family Name (Last Name)Doe/li li partialName=GivenName[0] fullName=form1[0].#subform[6].GivenName[0] altName=Given Name (First Name)John/li li partialName=OtherNamesUsed[0] fullName=form1[0].#subform[6].OtherNamesUsed[0] altName=Maiden NameMr. Doe/li li partialName=StreetNumberName[0] fullName=form1[0].#subform[6].StreetNumberName[0] altName= Street Number and Name123 Main St./li ... Another idea I had was to include the partialName in the contents and not fill out the attrs: liStreetNumberName[0]: 123 Main St/li More unit tests on way... PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694782#comment-13694782 ] Nick Burch commented on TIKA-973: - To make reviewing easier, it might be handy if you could upload a PNG screenshot of one of these forms, so it's quick to view that alongside suggested html I'd be minded to go for something like: li title=Street Number and NameStreetNumberName[0]: 123 Main St/li So we'd have the alt name, the partial name, the value, but not the full name (but we would have the form/subform name elsewhere) PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693068#comment-13693068 ] Tim Allison commented on TIKA-973: -- Will submit patch and tests by end of the week. PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira