[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: TIKA-973.patch.tar.gz Middle-road change made. The alternate name is an attribute and partial name is added to content followed by a :. I also added a few more tests. PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz, TIKA-973.patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: i-9_screenshot.png Screenshot attached. Thanks again to: http://benlitchfield.sys-con.com/node/48543?page=0,1 for the code example and example doc. The middle ground that you recommend makes sense. PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: TIKA-973-patch.tar.gz Patch attached. Dumps contents of pdf forms at end of document. AcroForm field name metadata is in attribute values. Basic format is ol. Let me know how this looks. Thank you, Ben Litchfield, for org.apache.pdfbox.examples.fdf.PrintFields PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira