[ https://issues.apache.org/jira/browse/SOLR-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871743#comment-16871743 ]
Nicolas Larcipretti commented on SOLR-7189: ------------------------------------------- Does this solution works with PDF's as well? I was able to extract text from images (png) and images within docx, but could not with images within PDF's. Here's some info: Solr version: 8.1.1 Tesseract version: 3.04.01 PDF parser: "X-Parsed-By", [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser" ] > Allow DIH to extract content from embedded documents via Tika > ------------------------------------------------------------- > > Key: SOLR-7189 > URL: https://issues.apache.org/jira/browse/SOLR-7189 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler > Affects Versions: 5.0 > Reporter: Tim Allison > Assignee: Shalin Shekhar Mangar > Priority: Minor > Fix For: 5.1, 6.0 > > Attachments: SOLR-7189.patch, test_recursive_embedded.docx > > > DIH's TikaEntityProcessor doesn't currently extract content from embedded > documents/attachments within a file. It might be useful if users could > configure whether or not to include extraction of content from embedded > documents. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org