[ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892957#action_12892957 ]
Tommaso Teofili commented on SOLR-1902: --------------------------------------- Hi all, I had the same issue David has, so I applied the patch (modifying files one by one) to a fresh Solr 1.4.1 checkout and I managed to have most of my PDFs being indexed with text extracted (with the "example" Solr instance). Within the apache-solr-1.4.1 release I substituted all the files inside apache-solr-1.4.1/dist with the ones generated (inside the dist directory) invoking 'ant dist' on the patched 1.4.1 source code, also I substituted the release war with the generated (patched) war inside example/webapps (this last one was mandatory to avoid the NoSuchMethodError reported above) . Then I ran 'java -jar start.jar' from example dir and everything worked. Note that I used the latest version of pdfbox, jembox and fontbox (1.2.1). I can attach the patch to 1.4.1 code I used. > Tika no longer properly extracts content in Solr > ------------------------------------------------ > > Key: SOLR-1902 > URL: https://issues.apache.org/jira/browse/SOLR-1902 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Fix For: 4.0 > > > See > http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24 > It appears that since the upgrade to Tika 0.7, Tika is now selecting an > EmptyParser when uploading docs, which then outputs an empty XHTML > representation. Still, it's strange that the tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org