[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694521#comment-13694521 ] Daniel Bonniot de Ruisselet commented on TIKA-1109: --- Nick, thanks a lot for your explanation. If I understand correctly, what you are saying is that in general it cannot be guaranteed that the metadata is available during parsing, since that will depend on the format whether that's possible or not. That makes complete sense. Here I am asking specifically about the OOXML formats, with an example pptx file. As I understand the OOXML formats are zip files containing xml files. In test-classes/test-documents/testPPT.pptx, the metadata seems to be inside docProps/core.xml. Would it be possible to read the metadata first from there, before starting the parsing? Metadata not extracted before the context in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694612#comment-13694612 ] Nick Burch commented on TIKA-1109: -- For ooxml files, the metadata is mostly in a few different xml files within the zip. For excel, there's also a few bits stored in the main spreadsheet / sheet stream too... Not sure if it would break things if we did most of the metadata fetching first. Could you try moving the metadata line up, and see if the unit tests all still pass? Metadata not extracted before the context in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1053) Upgrade Tika Parsers to use ASM 4.x
[ https://issues.apache.org/jira/browse/TIKA-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694624#comment-13694624 ] Vincent Massol commented on TIKA-1053: -- Thanks again for fixing this. Any idea when Tika 1.4 is going to be released? (I'm still waiting for this fix). Upgrade Tika Parsers to use ASM 4.x --- Key: TIKA-1053 URL: https://issues.apache.org/jira/browse/TIKA-1053 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.2 Reporter: Vincent Massol Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1053.patch Right now Tika 1.2 uses ASM 3.1. However this is causing some issues for us on the XWiki project since we also bundle other framework that use a more recent version of ASM (we use pegdown which uses parboiled which draws ASM 4.0). The problem is that ASM 3.x and 4.0 are not compatible... See http://jira.xwiki.org/browse/XE-1269 for more details about the issue we're facing. Thanks for considering upgrading to ASM 4.x :) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1109: -- Summary: Metadata not extracted before the content in OOXML (pptx) (was: Metadata not extracted before the context in OOXML (pptx)) Metadata not extracted before the content in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694659#comment-13694659 ] Daniel Bonniot de Ruisselet commented on TIKA-1109: --- I tried it. It broke two tests (same cause): as you mentioned, in excel the metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a change in how that is implemented, and: {{[INFO] }} {{[INFO] Building Apache Tika 1.5-SNAPSHOT}} {{[INFO] }} {{[INFO]}} {{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}} {{[INFO] }} {{[INFO] Reactor Summary:}} {{[INFO]}} {{[INFO] Apache Tika parent SUCCESS [0.806s]}} {{[INFO] Apache Tika core .. SUCCESS [8.418s]}} {{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}} {{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}} {{[INFO] Apache Tika application ... SUCCESS [3.336s]}} {{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}} {{[INFO] Apache Tika server SUCCESS [5.312s]}} {{[INFO] Apache Tika ... SUCCESS [0.014s]}} {{[INFO] }} {{[INFO] BUILD SUCCESS}} {{[INFO] }} {{[INFO] Total time: 47.498s}} {{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}} {{[INFO] Final Memory: 27M/1930M}} {{[INFO] }} {{dbonniot@naming:~/world/tika$ svn diff | diffstat}} {{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java | 11 -}} {{ main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java | 36 ++}} {{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java | 56 ++}} {{ 3 files changed, 74 insertions(+), 29 deletions(-)}} {{dbonniot@naming:~/world/tika$ svn diff /tmp/TIKA-1109.patch}} The logic is OOXMLExtractorFactory is now simpler, since I could remove the extra shielding, which I suppose was made necessary by the previous ordering. And the metadata for OOXML formats is now available at parse time, as tested by the added test to OOXMLParserTest :) Metadata not extracted before the content in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694659#comment-13694659 ] Daniel Bonniot de Ruisselet edited comment on TIKA-1109 at 6/27/13 12:27 PM: - I tried it. It broke two tests (same cause): as you mentioned, in excel the metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a change in how that is implemented, and: {{[INFO] }} {{[INFO] Building Apache Tika 1.5-SNAPSHOT}} {{[INFO] }} {{[INFO]}} {{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}} {{[INFO] }} {{[INFO] Reactor Summary:}} {{[INFO]}} {{[INFO] Apache Tika parent SUCCESS [0.806s]}} {{[INFO] Apache Tika core .. SUCCESS [8.418s]}} {{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}} {{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}} {{[INFO] Apache Tika application ... SUCCESS [3.336s]}} {{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}} {{[INFO] Apache Tika server SUCCESS [5.312s]}} {{[INFO] Apache Tika ... SUCCESS [0.014s]}} {{[INFO] }} {{[INFO] BUILD SUCCESS}} {{[INFO] }} {{[INFO] Total time: 47.498s}} {{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}} {{[INFO] Final Memory: 27M/1930M}} {{[INFO] }} {{dbonniot@naming:~/world/tika$ svn diff | diffstat}} {{main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java | 11 -}} {{main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java | 36 ++}} {{test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java | 56 ++}} {{3 files changed, 74 insertions(+), 29 deletions(-)}} {{dbonniot@naming:~/world/tika$ svn diff /tmp/TIKA-1109.patch}} The logic is OOXMLExtractorFactory is now simpler, since I could remove the extra shielding, which I suppose was made necessary by the previous ordering. And the metadata for OOXML formats is now available at parse time, as tested by the added test to OOXMLParserTest :) was (Author: dbr): I tried it. It broke two tests (same cause): as you mentioned, in excel the metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a change in how that is implemented, and: {{[INFO] }} {{[INFO] Building Apache Tika 1.5-SNAPSHOT}} {{[INFO] }} {{[INFO]}} {{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}} {{[INFO] }} {{[INFO] Reactor Summary:}} {{[INFO]}} {{[INFO] Apache Tika parent SUCCESS [0.806s]}} {{[INFO] Apache Tika core .. SUCCESS [8.418s]}} {{[INFO] Apache Tika parsers ... SUCCESS [26.857s]}} {{[INFO] Apache Tika XMP ... SUCCESS [0.789s]}} {{[INFO] Apache Tika application ... SUCCESS [3.336s]}} {{[INFO] Apache Tika OSGi bundle ... SUCCESS [1.204s]}} {{[INFO] Apache Tika server SUCCESS [5.312s]}} {{[INFO] Apache Tika ... SUCCESS [0.014s]}} {{[INFO] }} {{[INFO] BUILD SUCCESS}} {{[INFO] }} {{[INFO] Total time: 47.498s}} {{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}} {{[INFO] Final Memory: 27M/1930M}} {{[INFO] }} {{dbonniot@naming:~/world/tika$ svn diff | diffstat}} {{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java | 11 -}} {{ main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java | 36 ++}} {{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java | 56 ++}} {{ 3 files changed, 74 insertions(+), 29 deletions(-)}} {{dbonniot@naming:~/world/tika$ svn diff /tmp/TIKA-1109.patch}} The logic is OOXMLExtractorFactory is now
[jira] [Updated] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Bonniot de Ruisselet updated TIKA-1109: -- Attachment: TIKA-1109.patch Metadata not extracted before the content in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 Attachments: TIKA-1109.patch It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1109. -- Resolution: Fixed Thanks for the patch, making life easier for downstream users AND tidying the code at the same time is always a win :) Applied (with a slight comment tweak) in r1497332. Metadata not extracted before the content in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Labels: patch Fix For: 1.5 Attachments: TIKA-1109.patch It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1053) Upgrade Tika Parsers to use ASM 4.x
[ https://issues.apache.org/jira/browse/TIKA-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694677#comment-13694677 ] Vincent Massol commented on TIKA-1053: -- Cool, thanks Nick. Upgrade Tika Parsers to use ASM 4.x --- Key: TIKA-1053 URL: https://issues.apache.org/jira/browse/TIKA-1053 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.2 Reporter: Vincent Massol Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1053.patch Right now Tika 1.2 uses ASM 3.1. However this is causing some issues for us on the XWiki project since we also bundle other framework that use a more recent version of ASM (we use pegdown which uses parboiled which draws ASM 4.0). The problem is that ASM 3.x and 4.0 are not compatible... See http://jira.xwiki.org/browse/XE-1269 for more details about the issue we're facing. Thanks for considering upgrading to ASM 4.x :) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694741#comment-13694741 ] Nick Burch commented on TIKA-973: - Patch looks promising to me, but I don't know enough about PDF so I've not been able to give it a thorough review Let's give it a few days before applying, to give others a chance to offer feedback One thing that might be good is in the unit test, to check for data from each form in turn, so we cover more cases PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694774#comment-13694774 ] Tim Allison commented on TIKA-973: -- Agree on both. Also would appreciate feedback on what the output should be. The current code extracts this unseemly xhtml: div class=acroform ol li partialName=form1[0] fullName=form1[0]/ ol li partialName=#subform[6] fullName=form1[0].#subform[6]/ li partialName=MiddleInitial[0] fullName=form1[0].#subform[6].MiddleInitial[0] altName=Enter Middle Initial (MI)X/li li partialName=FamilyName[0] fullName=form1[0].#subform[6].FamilyName[0] altName=Section 1. Employee Information and Attestation. Family Name (Last Name)Doe/li li partialName=GivenName[0] fullName=form1[0].#subform[6].GivenName[0] altName=Given Name (First Name)John/li li partialName=OtherNamesUsed[0] fullName=form1[0].#subform[6].OtherNamesUsed[0] altName=Maiden NameMr. Doe/li li partialName=StreetNumberName[0] fullName=form1[0].#subform[6].StreetNumberName[0] altName= Street Number and Name123 Main St./li ... Another idea I had was to include the partialName in the contents and not fill out the attrs: liStreetNumberName[0]: 123 Main St/li More unit tests on way... PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694782#comment-13694782 ] Nick Burch commented on TIKA-973: - To make reviewing easier, it might be handy if you could upload a PNG screenshot of one of these forms, so it's quick to view that alongside suggested html I'd be minded to go for something like: li title=Street Number and NameStreetNumberName[0]: 123 Main St/li So we'd have the alt name, the partial name, the value, but not the full name (but we would have the form/subform name elsewhere) PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-973: - Attachment: i-9_screenshot.png Screenshot attached. Thanks again to: http://benlitchfield.sys-con.com/node/48543?page=0,1 for the code example and example doc. The middle ground that you recommend makes sense. PDF form data isn't included in extracted content. -- Key: TIKA-973 URL: https://issues.apache.org/jira/browse/TIKA-973 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.2 Reporter: Michael Graessle Priority: Minor Attachments: i-9_screenshot.png, TIKA-973-patch.tar.gz When extracting content from PDFs, PDF form data isn't extracted. The following code extracts this data via PDF box, but it seems like something Tika should be doing. PDDocumentCatalog docCatalog = load.getDocumentCatalog(); if (docCatalog != null) { PDAcroForm acroForm = docCatalog.getAcroForm(); if (acroForm != null) { @SuppressWarnings(unchecked) ListPDField fields = acroForm.getFields(); if (fields != null fields.size() 0) { documentContent.append( ); for (PDField field : fields) { if (field.getValue()!=null) { documentContent.append(field.getValue()); documentContent.append( ); } } } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira