[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694774#comment-13694774 ]
Tim Allison commented on TIKA-973: ---------------------------------- Agree on both. Also would appreciate feedback on what the output should be. The current code extracts this unseemly xhtml: <div class="acroform"> <ol> <li partialName="form1[0]" fullName="form1[0]"/> <ol> <li partialName="#subform[6]" fullName="form1[0].#subform[6]"/> <li partialName="MiddleInitial[0]" fullName="form1[0].#subform[6].MiddleInitial[0]" altName="Enter Middle Initial (MI)">X</li> <li partialName="FamilyName[0]" fullName="form1[0].#subform[6].FamilyName[0]" altName="Section 1. Employee Information and Attestation. Family Name (Last Name)">Doe</li> <li partialName="GivenName[0]" fullName="form1[0].#subform[6].GivenName[0]" altName="Given Name (First Name)">John</li> <li partialName="OtherNamesUsed[0]" fullName="form1[0].#subform[6].OtherNamesUsed[0]" altName="Maiden Name">Mr. Doe</li> <li partialName="StreetNumberName[0]" fullName="form1[0].#subform[6].StreetNumberName[0]" altName=" Street Number and Name">123 Main St.</li> > ... Another idea I had was to include the partialName in the contents and not fill out the attrs: <li>StreetNumberName[0]: 123 Main St</li> More unit tests on way... > PDF form data isn't included in extracted content. > -------------------------------------------------- > > Key: TIKA-973 > URL: https://issues.apache.org/jira/browse/TIKA-973 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 1.2 > Reporter: Michael Graessle > Priority: Minor > Attachments: TIKA-973-patch.tar.gz > > > When extracting content from PDFs, PDF form data isn't extracted. > The following code extracts this data via PDF box, but it seems like > something Tika should be doing. > PDDocumentCatalog docCatalog = load.getDocumentCatalog(); > if (docCatalog != null) { > PDAcroForm acroForm = docCatalog.getAcroForm(); > if (acroForm != null) { > @SuppressWarnings("unchecked") > List<PDField> fields = acroForm.getFields(); > if (fields != null && fields.size() > 0) { > documentContent.append(" "); > for (PDField field : fields) { > if (field.getValue()!=null) { > documentContent.append(field.getValue()); > documentContent.append(" "); > } > } > } > } > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira